Files
2026-01-30 03:04:10 +00:00

5.1 KiB

Cloudflare R2 SQL Skill Reference

Expert guidance for Cloudflare R2 SQL - serverless distributed query engine for Apache Iceberg tables.

Reading Order

New to R2 SQL? Start here:

  1. Read "What is R2 SQL?" and "When to Use" below
  2. configuration.md - Enable catalog, create tokens
  3. patterns.md - Wrangler CLI and integration examples
  4. api.md - SQL syntax and query reference
  5. gotchas.md - Limitations and troubleshooting

Quick reference? Jump to:

What is R2 SQL?

R2 SQL is Cloudflare's serverless distributed analytics query engine for querying Apache Iceberg tables in R2 Data Catalog. Features:

  • Serverless - No clusters to manage, no infrastructure
  • Distributed - Leverages Cloudflare's global network for parallel execution
  • SQL interface - Familiar SQL syntax for analytics queries
  • Zero egress fees - Query from any cloud/region without data transfer costs
  • Open beta - Free during beta (standard R2 storage costs apply)

What is Apache Iceberg?

Open table format for large-scale analytics datasets in object storage:

  • ACID transactions - Safe concurrent reads/writes
  • Metadata optimization - Fast queries without full table scans
  • Schema evolution - Add/rename/drop columns without rewrites
  • Partitioning - Organize data for efficient pruning

When to Use

Use R2 SQL for:

  • Log analytics - Query application/system logs with WHERE filters and aggregations
  • BI dashboards - Generate reports from large analytical datasets
  • Fraud detection - Analyze transaction patterns with GROUP BY/HAVING
  • Multi-cloud analytics - Query data from any cloud without egress fees
  • Ad-hoc exploration - Run SQL queries on Iceberg tables via Wrangler CLI

Don't use R2 SQL for:

  • Workers/Pages runtime - R2 SQL has no Workers binding, use HTTP API from external systems
  • Real-time queries (<100ms) - Optimized for analytical batch queries, not OLTP
  • Complex joins/CTEs - Limited SQL feature set (no JOINs, subqueries, CTEs currently)
  • Small datasets (<1GB) - Setup overhead not justified

Decision Tree: Need to Query R2 Data?

Do you need to query structured data in R2?
├─ YES, data is in Iceberg tables
│  ├─ Need SQL interface? → Use R2 SQL (this reference)
│  ├─ Need Python API? → See r2-data-catalog reference (PyIceberg)
│  └─ Need other engine? → See r2-data-catalog reference (Spark, Trino, etc.)
│
├─ YES, but not in Iceberg format
│  ├─ Streaming data? → Use Pipelines to write to Data Catalog, then R2 SQL
│  └─ Static files? → Use PyIceberg to create Iceberg tables, then R2 SQL
│
└─ NO, just need object storage → Use R2 reference (not R2 SQL)

Architecture Overview

Query Planner:

  • Top-down metadata investigation with multi-layer pruning
  • Partition-level, column-level, and row-group pruning
  • Streaming pipeline - execution starts before planning completes
  • Early termination with LIMIT - stops when result complete

Query Execution:

  • Coordinator distributes work to workers across Cloudflare network
  • Workers run Apache DataFusion for parallel query execution
  • Parquet column pruning - reads only required columns
  • Ranged reads from R2 for efficiency

Aggregation Strategies:

  • Scatter-gather - simple aggregations (SUM, COUNT, AVG)
  • Shuffling - ORDER BY/HAVING on aggregates via hash partitioning

Quick Start

# 1. Enable R2 Data Catalog on bucket
npx wrangler r2 bucket catalog enable my-bucket

# 2. Create API token (Admin Read & Write)
# Dashboard: R2 → Manage API tokens → Create API token

# 3. Set environment variable
export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>

# 4. Run query
npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.my_table LIMIT 10"

Important Limitations

CRITICAL: No Workers Binding

  • R2 SQL cannot be called directly from Workers/Pages code
  • For programmatic access, use HTTP API from external systems
  • Or query via PyIceberg, Spark, etc. (see r2-data-catalog reference)

SQL Feature Set:

  • No JOINs, CTEs, subqueries, window functions
  • ORDER BY supports aggregation columns (not just partition keys)
  • LIMIT max 10,000 (default 500)
  • See gotchas.md for complete limitations

In This Reference

  • configuration.md - Enable catalog, create API tokens
  • api.md - SQL syntax, functions, operators, data types
  • patterns.md - Wrangler CLI, HTTP API, Pipelines, PyIceberg
  • gotchas.md - Limitations, troubleshooting, performance tips

See Also