claude-skills/security-review/references/data-protection.md

# Data Protection Reference

## Overview

Data protection encompasses safeguarding sensitive information throughout its lifecycle: collection, processing, storage, transmission, and disposal. Security failures at any stage can lead to data breaches.

## Sensitive Data Categories

### Personal Identifiable Information (PII)
- Full names, addresses, phone numbers
- Email addresses
- Social Security Numbers, national IDs
- Dates of birth
- Biometric data

### Financial Information
- Credit card numbers (PAN)
- Bank account numbers
- Financial transactions
- Payment credentials

### Authentication Credentials
- Passwords (plaintext or weakly hashed)
- API keys and tokens
- Session identifiers
- Private keys

### Health Information (PHI/HIPAA)
- Medical records
- Health conditions
- Treatment information
- Insurance data

---

## Sensitive Data Exposure Prevention

### 1. Data Classification

Classify all data by sensitivity level:

| Level | Examples | Handling |
|-------|----------|----------|
| **Public** | Marketing content | No restrictions |
| **Internal** | Employee directory | Access controls |
| **Confidential** | Customer data | Encryption + access controls |
| **Restricted** | Passwords, keys, PCI data | Strong encryption + audit logs |

### 2. Minimize Data Collection

```python
# VULNERABLE: Collecting unnecessary data
user_data = {
    'name': form.name,
    'email': form.email,
    'ssn': form.ssn,  # Why do you need this?
    'mother_maiden_name': form.mother_maiden_name,  # Security risk
    'password': form.password,  # Never store plaintext
}

# SAFE: Collect only what's needed
user_data = {
    'name': form.name,
    'email': form.email,
}
```

### 3. Encryption at Rest

```python
# Database-level encryption
# Configure in database settings (TDE for SQL Server, etc.)

# Application-level encryption for specific fields
from cryptography.fernet import Fernet

def encrypt_ssn(ssn):
    f = Fernet(get_encryption_key())
    return f.encrypt(ssn.encode())

def decrypt_ssn(encrypted_ssn):
    f = Fernet(get_encryption_key())
    return f.decrypt(encrypted_ssn).decode()
```

### 4. Encryption in Transit

```python
# VULNERABLE: HTTP endpoint
app.run(host='0.0.0.0', port=80)

# SAFE: HTTPS required
app.run(host='0.0.0.0', port=443, ssl_context='adhoc')

# BETTER: Proper TLS configuration
ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ssl_context.load_cert_chain('cert.pem', 'key.pem')
ssl_context.minimum_version = ssl.TLSVersion.TLSv1_2
```

---

## Information Disclosure Prevention

### Error Messages

```python
# VULNERABLE: Detailed error messages
@app.errorhandler(Exception)
def handle_error(e):
    return {
        'error': str(e),
        'traceback': traceback.format_exc(),
        'sql_query': last_query,
        'server': socket.gethostname()
    }, 500

# SAFE: Generic error messages
@app.errorhandler(Exception)
def handle_error(e):
    # Log full details server-side
    app.logger.error(f"Error: {e}", exc_info=True)

    # Return generic message to client
    return {'error': 'An unexpected error occurred'}, 500
```

### Stack Traces

```python
# VULNERABLE: Debug mode in production
app.run(debug=True)

# SAFE: Debug off, custom error pages
app.run(debug=False)

@app.errorhandler(404)
def not_found(e):
    return render_template('404.html'), 404

@app.errorhandler(500)
def server_error(e):
    return render_template('500.html'), 500
```

### API Response Filtering

```python
# VULNERABLE: Returning all fields
@app.route('/api/users/<id>')
def get_user(id):
    user = User.query.get(id)
    return jsonify(user.__dict__)  # Includes password_hash, internal_id, etc.

# SAFE: Explicit field selection
@app.route('/api/users/<id>')
def get_user(id):
    user = User.query.get(id)
    return jsonify({
        'id': user.public_id,
        'name': user.name,
        'email': user.email
    })
```

### Server Headers

```python
# VULNERABLE: Technology disclosure
# Response headers reveal:
# Server: Apache/2.4.41 (Ubuntu)
# X-Powered-By: PHP/7.4.3
# X-AspNet-Version: 4.0.30319

# SAFE: Remove or genericize headers
# In nginx:
# server_tokens off;

# In Express.js:
app.disable('x-powered-by');

# In Flask:
@app.after_request
def remove_headers(response):
    response.headers.pop('Server', None)
    return response
```

---

## Logging Security

### What NOT to Log

```python
# VULNERABLE: Logging sensitive data
logger.info(f"User login: {username}, password: {password}")
logger.info(f"API call with key: {api_key}")
logger.info(f"Credit card: {card_number}")
logger.debug(f"Session token: {session_id}")

# SAFE: Sanitized logging
logger.info(f"User login: {username}")
logger.info(f"API call with key: {api_key[:4]}****")
logger.info(f"Credit card: ****{card_number[-4:]}")
logger.debug(f"Session token: {hash_for_logging(session_id)}")
```

### Sensitive Data Patterns to Avoid in Logs

| Data Type | Pattern |
|-----------|---------|
| Passwords | `password`, `passwd`, `pwd`, `secret` |
| API Keys | `api_key`, `apikey`, `token`, `bearer` |
| Credit Cards | 16-digit numbers, `card_number` |
| SSN | `\d{3}-\d{2}-\d{4}`, `ssn`, `social` |
| Session IDs | `session`, `sess_id`, `jsessionid` |

### Log Injection Prevention

```python
# VULNERABLE: User input directly in logs
logger.info(f"Search query: {user_input}")
# Attack: user_input = "test\nINFO: Admin logged in"

# SAFE: Sanitize before logging
def sanitize_for_log(text):
    return text.replace('\n', '\\n').replace('\r', '\\r')

logger.info(f"Search query: {sanitize_for_log(user_input)}")
```

---

## Secure Data Disposal

### Memory Handling

```python
# Python strings are immutable - difficult to clear
# Use bytearray for sensitive data when possible

# BETTER: Clear sensitive data
import ctypes

def secure_zero(data):
    """Zero out sensitive data in memory."""
    if isinstance(data, bytearray):
        for i in range(len(data)):
            data[i] = 0
    elif isinstance(data, bytes):
        # Can't modify bytes, but can overwrite the reference
        pass

# In Java:
# char[] password = getPassword();
# try { ... }
# finally { Arrays.fill(password, '\0'); }
```

### File Deletion

```python
# VULNERABLE: Simple delete (data recoverable)
os.remove(sensitive_file)

# SAFER: Overwrite before delete
def secure_delete(filepath):
    with open(filepath, 'ba+') as f:
        length = f.tell()
        f.seek(0)
        f.write(os.urandom(length))  # Random overwrite
        f.flush()
        os.fsync(f.fileno())
    os.remove(filepath)
```

### Database Retention

```python
# Implement data retention policies
def cleanup_old_data():
    cutoff = datetime.now() - timedelta(days=RETENTION_DAYS)

    # Delete old records
    OldRecord.query.filter(OldRecord.created_at < cutoff).delete()

    # Or anonymize instead of delete
    User.query.filter(User.last_login < cutoff).update({
        'email': func.concat('deleted_', User.id, '@example.com'),
        'name': 'Deleted User',
        'phone': None
    })
```

---

## Cache Security

```python
# VULNERABLE: Caching sensitive data
@cache.cached(timeout=3600)
def get_user_with_ssn(user_id):
    return User.query.get(user_id)  # Includes SSN

# SAFE: Don't cache sensitive data
def get_user_with_ssn(user_id):
    return User.query.get(user_id)  # Not cached

# Or cache only non-sensitive parts
@cache.cached(timeout=3600)
def get_user_profile(user_id):
    user = User.query.get(user_id)
    return {
        'id': user.id,
        'name': user.name,
        # SSN excluded
    }
```

### Cache Headers

```python
# For sensitive pages
response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
response.headers['Pragma'] = 'no-cache'
response.headers['Expires'] = '0'
```

---

## Grep Patterns for Detection

```bash
# Sensitive data in logs
grep -rn "logger.*password\|log.*password\|print.*password" --include="*.py" --include="*.js"
grep -rn "logger.*token\|log.*api_key\|print.*secret" --include="*.py" --include="*.js"

# Debug mode
grep -rn "debug.*[Tt]rue\|DEBUG.*=.*1" --include="*.py" --include="*.js" --include="*.env"

# Stack traces in responses
grep -rn "traceback\|stack_trace\|exc_info" --include="*.py" | grep -i "return\|response\|json"

# Verbose errors
grep -rn "str(e)\|str(exception)" --include="*.py" | grep -i "return\|response"

# Technology disclosure
grep -rn "X-Powered-By\|Server:" --include="*.py" --include="*.js" --include="*.conf"

# Missing cache headers
grep -rn "Set-Cookie\|session" --include="*.py" | grep -v "Cache-Control"
```

---

## Testing Checklist

- [ ] Sensitive data encrypted at rest
- [ ] All transmissions over TLS 1.2+
- [ ] Error messages are generic (no stack traces, SQL errors, paths)
- [ ] Logging excludes sensitive data (passwords, tokens, PII)
- [ ] API responses filtered to necessary fields only
- [ ] Server headers don't reveal technology stack
- [ ] Sensitive pages have no-cache headers
- [ ] Data retention policies implemented
- [ ] Secure deletion procedures for sensitive files
- [ ] Debug mode disabled in production

---

## References

- [OWASP Logging Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html)
- [OWASP Error Handling Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Error_Handling_Cheat_Sheet.html)
- [CWE-200: Information Exposure](https://cwe.mitre.org/data/definitions/200.html)
- [CWE-532: Information Exposure Through Log Files](https://cwe.mitre.org/data/definitions/532.html)
- [CWE-209: Error Message Information Leak](https://cwe.mitre.org/data/definitions/209.html)