Files
claude-skills/security-review/references/data-protection.md
2026-01-30 03:04:10 +00:00

9.3 KiB

Data Protection Reference

Overview

Data protection encompasses safeguarding sensitive information throughout its lifecycle: collection, processing, storage, transmission, and disposal. Security failures at any stage can lead to data breaches.

Sensitive Data Categories

Personal Identifiable Information (PII)

  • Full names, addresses, phone numbers
  • Email addresses
  • Social Security Numbers, national IDs
  • Dates of birth
  • Biometric data

Financial Information

  • Credit card numbers (PAN)
  • Bank account numbers
  • Financial transactions
  • Payment credentials

Authentication Credentials

  • Passwords (plaintext or weakly hashed)
  • API keys and tokens
  • Session identifiers
  • Private keys

Health Information (PHI/HIPAA)

  • Medical records
  • Health conditions
  • Treatment information
  • Insurance data

Sensitive Data Exposure Prevention

1. Data Classification

Classify all data by sensitivity level:

Level Examples Handling
Public Marketing content No restrictions
Internal Employee directory Access controls
Confidential Customer data Encryption + access controls
Restricted Passwords, keys, PCI data Strong encryption + audit logs

2. Minimize Data Collection

# VULNERABLE: Collecting unnecessary data
user_data = {
    'name': form.name,
    'email': form.email,
    'ssn': form.ssn,  # Why do you need this?
    'mother_maiden_name': form.mother_maiden_name,  # Security risk
    'password': form.password,  # Never store plaintext
}

# SAFE: Collect only what's needed
user_data = {
    'name': form.name,
    'email': form.email,
}

3. Encryption at Rest

# Database-level encryption
# Configure in database settings (TDE for SQL Server, etc.)

# Application-level encryption for specific fields
from cryptography.fernet import Fernet

def encrypt_ssn(ssn):
    f = Fernet(get_encryption_key())
    return f.encrypt(ssn.encode())

def decrypt_ssn(encrypted_ssn):
    f = Fernet(get_encryption_key())
    return f.decrypt(encrypted_ssn).decode()

4. Encryption in Transit

# VULNERABLE: HTTP endpoint
app.run(host='0.0.0.0', port=80)

# SAFE: HTTPS required
app.run(host='0.0.0.0', port=443, ssl_context='adhoc')

# BETTER: Proper TLS configuration
ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ssl_context.load_cert_chain('cert.pem', 'key.pem')
ssl_context.minimum_version = ssl.TLSVersion.TLSv1_2

Information Disclosure Prevention

Error Messages

# VULNERABLE: Detailed error messages
@app.errorhandler(Exception)
def handle_error(e):
    return {
        'error': str(e),
        'traceback': traceback.format_exc(),
        'sql_query': last_query,
        'server': socket.gethostname()
    }, 500

# SAFE: Generic error messages
@app.errorhandler(Exception)
def handle_error(e):
    # Log full details server-side
    app.logger.error(f"Error: {e}", exc_info=True)

    # Return generic message to client
    return {'error': 'An unexpected error occurred'}, 500

Stack Traces

# VULNERABLE: Debug mode in production
app.run(debug=True)

# SAFE: Debug off, custom error pages
app.run(debug=False)

@app.errorhandler(404)
def not_found(e):
    return render_template('404.html'), 404

@app.errorhandler(500)
def server_error(e):
    return render_template('500.html'), 500

API Response Filtering

# VULNERABLE: Returning all fields
@app.route('/api/users/<id>')
def get_user(id):
    user = User.query.get(id)
    return jsonify(user.__dict__)  # Includes password_hash, internal_id, etc.

# SAFE: Explicit field selection
@app.route('/api/users/<id>')
def get_user(id):
    user = User.query.get(id)
    return jsonify({
        'id': user.public_id,
        'name': user.name,
        'email': user.email
    })

Server Headers

# VULNERABLE: Technology disclosure
# Response headers reveal:
# Server: Apache/2.4.41 (Ubuntu)
# X-Powered-By: PHP/7.4.3
# X-AspNet-Version: 4.0.30319

# SAFE: Remove or genericize headers
# In nginx:
# server_tokens off;

# In Express.js:
app.disable('x-powered-by');

# In Flask:
@app.after_request
def remove_headers(response):
    response.headers.pop('Server', None)
    return response

Logging Security

What NOT to Log

# VULNERABLE: Logging sensitive data
logger.info(f"User login: {username}, password: {password}")
logger.info(f"API call with key: {api_key}")
logger.info(f"Credit card: {card_number}")
logger.debug(f"Session token: {session_id}")

# SAFE: Sanitized logging
logger.info(f"User login: {username}")
logger.info(f"API call with key: {api_key[:4]}****")
logger.info(f"Credit card: ****{card_number[-4:]}")
logger.debug(f"Session token: {hash_for_logging(session_id)}")

Sensitive Data Patterns to Avoid in Logs

Data Type Pattern
Passwords password, passwd, pwd, secret
API Keys api_key, apikey, token, bearer
Credit Cards 16-digit numbers, card_number
SSN \d{3}-\d{2}-\d{4}, ssn, social
Session IDs session, sess_id, jsessionid

Log Injection Prevention

# VULNERABLE: User input directly in logs
logger.info(f"Search query: {user_input}")
# Attack: user_input = "test\nINFO: Admin logged in"

# SAFE: Sanitize before logging
def sanitize_for_log(text):
    return text.replace('\n', '\\n').replace('\r', '\\r')

logger.info(f"Search query: {sanitize_for_log(user_input)}")

Secure Data Disposal

Memory Handling

# Python strings are immutable - difficult to clear
# Use bytearray for sensitive data when possible

# BETTER: Clear sensitive data
import ctypes

def secure_zero(data):
    """Zero out sensitive data in memory."""
    if isinstance(data, bytearray):
        for i in range(len(data)):
            data[i] = 0
    elif isinstance(data, bytes):
        # Can't modify bytes, but can overwrite the reference
        pass

# In Java:
# char[] password = getPassword();
# try { ... }
# finally { Arrays.fill(password, '\0'); }

File Deletion

# VULNERABLE: Simple delete (data recoverable)
os.remove(sensitive_file)

# SAFER: Overwrite before delete
def secure_delete(filepath):
    with open(filepath, 'ba+') as f:
        length = f.tell()
        f.seek(0)
        f.write(os.urandom(length))  # Random overwrite
        f.flush()
        os.fsync(f.fileno())
    os.remove(filepath)

Database Retention

# Implement data retention policies
def cleanup_old_data():
    cutoff = datetime.now() - timedelta(days=RETENTION_DAYS)

    # Delete old records
    OldRecord.query.filter(OldRecord.created_at < cutoff).delete()

    # Or anonymize instead of delete
    User.query.filter(User.last_login < cutoff).update({
        'email': func.concat('deleted_', User.id, '@example.com'),
        'name': 'Deleted User',
        'phone': None
    })

Cache Security

# VULNERABLE: Caching sensitive data
@cache.cached(timeout=3600)
def get_user_with_ssn(user_id):
    return User.query.get(user_id)  # Includes SSN

# SAFE: Don't cache sensitive data
def get_user_with_ssn(user_id):
    return User.query.get(user_id)  # Not cached

# Or cache only non-sensitive parts
@cache.cached(timeout=3600)
def get_user_profile(user_id):
    user = User.query.get(user_id)
    return {
        'id': user.id,
        'name': user.name,
        # SSN excluded
    }

Cache Headers

# For sensitive pages
response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
response.headers['Pragma'] = 'no-cache'
response.headers['Expires'] = '0'

Grep Patterns for Detection

# Sensitive data in logs
grep -rn "logger.*password\|log.*password\|print.*password" --include="*.py" --include="*.js"
grep -rn "logger.*token\|log.*api_key\|print.*secret" --include="*.py" --include="*.js"

# Debug mode
grep -rn "debug.*[Tt]rue\|DEBUG.*=.*1" --include="*.py" --include="*.js" --include="*.env"

# Stack traces in responses
grep -rn "traceback\|stack_trace\|exc_info" --include="*.py" | grep -i "return\|response\|json"

# Verbose errors
grep -rn "str(e)\|str(exception)" --include="*.py" | grep -i "return\|response"

# Technology disclosure
grep -rn "X-Powered-By\|Server:" --include="*.py" --include="*.js" --include="*.conf"

# Missing cache headers
grep -rn "Set-Cookie\|session" --include="*.py" | grep -v "Cache-Control"

Testing Checklist

  • Sensitive data encrypted at rest
  • All transmissions over TLS 1.2+
  • Error messages are generic (no stack traces, SQL errors, paths)
  • Logging excludes sensitive data (passwords, tokens, PII)
  • API responses filtered to necessary fields only
  • Server headers don't reveal technology stack
  • Sensitive pages have no-cache headers
  • Data retention policies implemented
  • Secure deletion procedures for sensitive files
  • Debug mode disabled in production

References