sales-data-analysis/README.md

# Sales Analysis Template

**A best-in-class, reusable template for sales invoice detail analysis**

**Optimized for Cursor AI** - Just ask the AI to create analyses and it handles everything automatically.

This template provides a complete framework for analyzing sales data from any company. It's designed to be:
- **Flexible:** Works with different column names, date formats, and data structures
- **Automated:** Interactive setup wizard configures everything for your company
- **AI-Optimized:** Fully optimized for Cursor - AI knows all patterns and generates code automatically
- **Production-Ready:** Includes error handling, validation, and best practices

---

## 🚀 Quick Start

### 1. Setup (Automated)

Run the interactive setup wizard:

```bash
python setup_wizard.py
```

The wizard will ask you about:
- Company name and analysis date
- Data file location
- Column names in your CSV
- Date range and LTM configuration
- Exclusion filters (if needed)

### 2. Manual Setup (Alternative)

If you prefer to configure manually:

1. **Update `config.py`** with your company-specific settings:
   - `COMPANY_NAME`: Your company name
   - `DATA_FILE`: Your CSV filename
   - `REVENUE_COLUMN`: Your revenue/amount column name
   - `DATE_COLUMN`: Your primary date column
   - Column mappings for Customer, Item, etc.
   - Date range and LTM settings

2. **Place your data file** in the template directory (or update `DATA_DIR` in config.py)

### 3. Test Data Loading

Verify your configuration works:

```bash
python -c "from data_loader import load_sales_data; from config import get_data_path; df = load_sales_data(get_data_path()); print(f'Loaded {len(df):,} rows')"
```

### 4. Create Your First Analysis

Copy the template and customize:

```bash
cp analysis_template.py my_first_analysis.py
# Edit my_first_analysis.py with your analysis logic
python my_first_analysis.py
```

---

## 📁 Project Structure

```
sales_analysis_template/
├── README.md                    # This file
├── QUICK_START.md               # Quick start guide
├── TEMPLATE_OVERVIEW.md         # High-level overview
├── TEMPLATE_SUMMARY.md          # Comprehensive template summary
├── EXAMPLES.md                  # Example scripts guide
├── SETUP_CHECKLIST.md           # Setup verification checklist
├── requirements.txt             # Python dependencies
├── setup_wizard.py              # Interactive setup wizard
│
├── config.py                    # ⭐ Configuration (customize for your company)
├── config_validator.py          # Configuration validation utility
│
├── data_loader.py               # ⭐ Data loading with fallback logic
├── data_quality.py              # Data quality reporting
├── data_processing.py           # Data transformation utilities
│
├── analysis_utils.py           # ⭐ Common utilities (formatters, LTM, helpers)
├── statistical_utils.py         # Statistical analysis utilities
├── validate_revenue.py          # Revenue validation utility
│
├── export_utils.py              # Export to CSV/Excel
├── report_generator.py          # PDF report generation
├── logger_config.py             # Logging configuration
│
├── analysis_template.py         # Template for creating new analyses
├── run_all_analyses.py          # Batch runner for all scripts
├── generate_sample_data.py      # Generate sample data for testing
│
├── examples/                    # Example analysis scripts
│   ├── annual_revenue_trend.py  # Simple annual revenue analysis
│   ├── customer_segmentation.py # RFM customer segmentation
│   ├── cohort_analysis.py       # Customer cohort analysis
│   └── product_performance.py   # Product performance analysis
│
├── tests/                       # Unit tests
│   ├── test_data_loader.py      # Data loader tests
│   ├── test_analysis_utils.py   # Analysis utils tests
│   └── test_config_validator.py # Config validator tests
│
└── .cursor/
    └── rules/                   # Cursor IDE rules (auto-loaded)
        ├── ai_assistant_guide.md # Complete AI assistant guide
        ├── advanced_analysis_patterns.md # Advanced techniques
        ├── analysis_patterns.md  # Common analysis patterns
        ├── chart_formatting.md   # Chart formatting rules
        ├── code_quality.md       # Code quality standards
        ├── common_errors.md      # Error troubleshooting
        ├── data_loading.md       # Data loading patterns
        ├── error_handling.md     # Error handling patterns
        └── ltm_methodology.md    # LTM methodology
```

---

## 🔧 Configuration Guide

### Required Configuration

**In `config.py`, you MUST configure:**

1. **Company Information:**
   ```python
   COMPANY_NAME = "Your Company Name"
   ```

2. **Data File:**
   ```python
   DATA_FILE = 'your_sales_data.csv'
   ```

3. **Column Mappings:**
   ```python
   REVENUE_COLUMN = 'USD'  # Your revenue column name
   DATE_COLUMN = 'InvoiceDate'  # Your date column name
   CUSTOMER_COLUMN = 'Customer'  # Your customer column name
   ```

4. **Date Range:**
   ```python
   MIN_YEAR = 2021
   MAX_DATE = pd.Timestamp('2025-09-30')
   ANALYSIS_YEARS = [2021, 2022, 2023, 2024, 2025]
   ```

### Optional Configuration

**LTM (Last Twelve Months):**
```python
LTM_ENABLED = True  # Set to False if all years are complete
LTM_START_MONTH = 10
LTM_START_YEAR = 2024
LTM_END_MONTH = 9
LTM_END_YEAR = 2025
```

**Exclusion Filters:**
```python
EXCLUSION_FILTERS = {
    'enabled': True,
    'exclude_by_column': 'Country',
    'exclude_values': ['Test', 'KVT']
}
```

**See `config.py` for all available options and detailed comments.**

---

## 📊 Data Requirements

### Required Columns

Your CSV file must have:
- **Revenue column:** A numeric column with sales amounts (configured as `REVENUE_COLUMN`)
- **Date column:** At least one date column (configured as `DATE_COLUMN`)

### Recommended Columns

For full analysis capabilities, include:
- **Customer/Account:** For customer segmentation and analysis
- **Item/Product:** For product analysis
- **Quantity:** For price calculations
- **Geographic:** Region, Country for geographic analysis
- **Segments:** Technology, EndMarket, ProductGroup for segmentation

### Date Column Fallback

The data loader supports fallback logic:
1. **Primary:** Uses `DATE_COLUMN` (e.g., InvoiceDate)
2. **Fallback 1:** Uses columns in `DATE_FALLBACK_COLUMNS` (e.g., Month, Year)
3. **Fallback 2:** Constructs from Year column if available

This ensures maximum date coverage even if some rows have missing dates.

---

## 💻 Creating Analysis Scripts

### Using the Template

1. **Copy the template:**
   ```bash
   cp analysis_template.py my_analysis.py
   ```

2. **Update configuration:**
   ```python
   ANALYSIS_NAME = "My Analysis"
   DESCRIPTION = "Description of what this analysis does"
   ```

3. **Implement your logic:**
   - Use `calculate_annual_metrics()` for annual aggregations
   - Use `setup_revenue_chart()` and `save_chart()` for visualizations
   - Follow patterns from `.cursor/rules/analysis_patterns.md`

4. **Run your analysis:**
   ```bash
   python my_analysis.py
   ```

### Standard Pattern

```python
from data_loader import load_sales_data, validate_data_structure
from analysis_utils import (
    get_ltm_period_config, calculate_annual_metrics,
    setup_revenue_chart, save_chart, apply_exclusion_filters
)
from config import get_data_path, REVENUE_COLUMN, CHART_SIZES

# Load and validate
df = load_sales_data(get_data_path())
is_valid, msg = validate_data_structure(df)
if not is_valid:
    print(f"ERROR: {msg}")
    return

# Apply filters
df = apply_exclusion_filters(df)

# Calculate metrics
ltm_start, ltm_end = get_ltm_period_config()
annual_df = calculate_annual_metrics(df, calculate_metrics, ltm_start, ltm_end)

# Create charts
fig, ax = plt.subplots(figsize=CHART_SIZES['medium'])
ax.plot(data / 1e6, ...)
setup_revenue_chart(ax)
save_chart(fig, 'chart.png')
```

---

## 🎯 Key Features

### 1. Flexible Data Loading

- Handles different column names via configuration
- Fallback logic for date parsing (100% coverage)
- Automatic validation and error reporting

### 2. LTM (Last Twelve Months) Support

- Automatic LTM calculation for partial years
- Apples-to-apples comparison with full calendar years
- Configurable LTM periods

### 3. Standardized Chart Formatting

- Automatic millions formatter for revenue charts
- Consistent styling and sizing
- Professional output ready for reports
- Optional interactive charts with Plotly

### 4. Exclusion Filters

- Easy configuration for excluding segments
- Useful for excluding test accounts, business units, etc.

### 5. Revenue Validation

- Automatic validation after each analysis
- Ensures data loading is working correctly
- Optional validation against expected values

### 6. Example Scripts

- Working examples for common analyses
- Demonstrates best practices
- Easy to customize and extend

### 7. Data Export

- Export results to CSV and Excel
- Formatted summary tables
- Multiple sheet support

### 8. Data Quality Reporting

- Comprehensive data quality checks
- Missing value analysis
- Outlier detection
- Data profiling

### 9. Configuration Validation

- Early error detection
- Validates column mappings
- Checks date ranges and LTM configuration

### 10. Statistical Utilities

- Year-over-year growth calculations
- CAGR (Compound Annual Growth Rate)
- Correlation analysis
- Statistical significance testing

### 11. Report Generation

- Combine multiple charts into PDF reports
- Professional formatting
- Summary tables and metadata

### 12. Logging Infrastructure

- Structured logging with file and console output
- Analysis execution tracking
- Configurable log levels

---

## 📚 Documentation

### For AI Agents (Cursor IDE)

The `.cursor/rules/` directory contains comprehensive rules that are automatically loaded by Cursor:

- **`ai_assistant_guide.md`:** Complete guide with ready-to-use prompts
- **`advanced_analysis_patterns.md`:** Advanced techniques (cohort, PVM, forecasting, etc.)
- **`analysis_patterns.md`:** Standard patterns for creating analyses
- **`data_loading.md`:** Always use `data_loader.py`, never `pd.read_csv()` directly
- **`chart_formatting.md`:** How to format charts correctly
- **`ltm_methodology.md`:** LTM implementation and usage
- **`common_errors.md`:** Troubleshooting guide
- **`code_quality.md`:** Code quality standards and Cursor best practices
- **`error_handling.md`:** How to write AI-friendly error messages

### For Developers

- **`config.py`:** Heavily commented with all configuration options
- **`analysis_template.py`:** Template with examples and comments
- **`analysis_utils.py`:** Well-documented utility functions

---

## 🔍 Common Analysis Types

This template supports all standard sales analyses:

### Revenue Analyses
- Annual revenue trends
- Monthly revenue analysis
- Revenue by segment/product/geography

### Customer Analyses
- Customer segmentation (RFM)
- Customer concentration
- Churn analysis
- Cohort analysis
- Customer lifetime value (CLV)

### Product Analyses
- Product performance
- Product lifecycle
- BCG matrix
- Market basket analysis

### Financial Analyses
- Price elasticity
- Contribution margin
- Price vs volume analysis

### Advanced Analyses
- Seasonality analysis
- Time series forecasting
- Customer churn prediction

**See `examples/` directory for working example scripts, or the original Dukane project for 24+ production analysis scripts.**

---

## 🛠️ Dependencies

Install required packages:

```bash
pip install -r requirements.txt
```

**Core dependencies:**
- `pandas` - Data manipulation
- `numpy` - Numerical operations
- `matplotlib` - Charting
- `seaborn` - Enhanced visualizations

**Optional dependencies** (uncomment in requirements.txt if needed):
- `openpyxl` - Excel export (export_utils.py)
- `plotly` - Interactive charts (analysis_utils.py)
- `reportlab` - PDF reports (report_generator.py)
- `scipy` - Statistical analysis (statistical_utils.py)
- `pytest` - Unit testing
- `pmdarima` - Time series forecasting
- `mlxtend` - Market basket analysis
- `scikit-learn` - Machine learning

---

## ⚠️ Important Notes

### Always Use Utilities

**✅ DO:**
```python
from data_loader import load_sales_data
from analysis_utils import setup_revenue_chart, save_chart
from config import REVENUE_COLUMN, CHART_SIZES
```

**❌ DON'T:**
```python
df = pd.read_csv('data.csv')  # Use data_loader instead
ax.plot(revenue, ...)  # Divide by 1e6 first, use setup_revenue_chart()
```

### Chart Formatting

**ALWAYS divide revenue by 1e6 before plotting:**
```python
ax.plot(revenue / 1e6, ...)  # Convert to millions
setup_revenue_chart(ax)  # Apply formatter
```

### LTM Labeling

**ALWAYS label LTM years correctly:**
```python
from config import get_ltm_label
ltm_label = get_ltm_label()  # Returns "2025 (LTM 9/2025)" or None
if ltm_label:
    title += f'\n({ltm_label})'
```

---

## 🐛 Troubleshooting

### Data Loading Issues

**Problem:** "Data file not found"
- **Solution:** Check `DATA_FILE` path in config.py
- **Solution:** Ensure file is in template directory or update `DATA_DIR`

**Problem:** "Required column 'USD' not found"
- **Solution:** Update `REVENUE_COLUMN` in config.py to match your CSV
- **Solution:** Check all column mappings in config.py

**Problem:** "All dates are NaN"
- **Solution:** Add fallback date columns to `DATE_FALLBACK_COLUMNS`
- **Solution:** Check date format in your CSV

### Analysis Issues

**Problem:** Charts show scientific notation (1e8)
- **Solution:** Divide by 1e6 before plotting: `ax.plot(data / 1e6, ...)`
- **Solution:** Use `setup_revenue_chart(ax)` to apply formatter

**Problem:** "DataFrame is empty" after filtering
- **Solution:** Check `MIN_YEAR` and `MAX_DATE` in config.py
- **Solution:** Verify `ANALYSIS_YEARS` includes years in your data

**See `.cursor/rules/common_errors.md` for more troubleshooting help.**

---

## 📝 Example Workflow

### Complete Analysis Workflow

1. **Setup:**
   ```bash
   python setup_wizard.py
   ```

2. **Test data loading:**
   ```bash
   python -c "from data_loader import load_sales_data; from config import get_data_path; df = load_sales_data(get_data_path()); print(f'✓ Loaded {len(df):,} rows')"
   ```

3. **Create analysis:**
   ```bash
   cp analysis_template.py revenue_analysis.py
   # Edit revenue_analysis.py
   ```

4. **Run analysis:**
   ```bash
   python revenue_analysis.py
   ```

5. **Add to batch runner:**
   ```python
   # In run_all_analyses.py:
   ANALYSIS_SCRIPTS = [
       'revenue_analysis.py',
       # ... other analyses
   ]
   ```

6. **Run all analyses:**
   ```bash
   python run_all_analyses.py
   ```

---

## 🤝 Best Practices

1. **Always validate data** after loading:
   ```python
   is_valid, msg = validate_data_structure(df)
   ```

2. **Use configuration values** instead of hardcoding:
   ```python
   from config import REVENUE_COLUMN  # ✅
   revenue = df['USD'].sum()  # ❌ Hardcoded
   ```

3. **Apply exclusion filters** if configured:
   ```python
   df = apply_exclusion_filters(df)
   ```

4. **Validate revenue** at end of each analysis:
   ```python
   validate_revenue(df, "Analysis Name")
   ```

5. **Use utility functions** for consistency:
   ```python
   from analysis_utils import calculate_annual_metrics, setup_revenue_chart
   ```

---

## 📄 License

This template is provided as-is for use in sales analysis projects.

---

## 🙏 Acknowledgments

This template is based on best practices developed during the Dukane Corporation sales analysis project, which included 24+ production-ready analysis scripts and comprehensive documentation.

---

## 📞 Support

For questions or issues:
1. Check `.cursor/rules/` for detailed patterns and troubleshooting
2. Review `config.py` comments for configuration options
3. See example analyses in the original Dukane project

---

**Last Updated:** January 2026
**Template Version:** 1.0
**Status:** Production Ready