Cloudflare R2 Data Catalog Skill Reference
Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
Reading Order
New to R2 Data Catalog? Start here:
- Read "What is R2 Data Catalog?" and "When to Use" below
- configuration.md - Enable catalog, create tokens
- patterns.md - PyIceberg setup and common patterns
- api.md - REST API reference as needed
- gotchas.md - Troubleshooting when issues arise
Quick reference? Jump to:
What is R2 Data Catalog?
R2 Data Catalog is a managed Apache Iceberg REST catalog built directly into R2 buckets. It provides:
- Apache Iceberg tables - ACID transactions, schema evolution, time-travel queries
- Zero-egress costs - Query from any cloud/region without data transfer fees
- Standard REST API - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
- No infrastructure - Fully managed, no catalog servers to run
- Public beta - Available to all R2 subscribers, no extra cost beyond R2 storage
What is Apache Iceberg?
Open table format for analytics datasets in object storage. Features:
- ACID transactions - Safe concurrent reads/writes
- Metadata optimization - Fast queries without full scans
- Schema evolution - Add/rename/delete columns without rewrites
- Time-travel - Query historical snapshots
- Partitioning - Organize data for efficient queries
When to Use
Use R2 Data Catalog for:
- Log analytics - Store and query application/system logs
- Data lakes/warehouses - Analytical datasets queried by multiple engines
- BI pipelines - Aggregate data for dashboards and reports
- Multi-cloud analytics - Share data across clouds without egress fees
- Time-series data - Event streams, metrics, sensor data
Don't use for:
- Transactional workloads - Use D1 or external database instead
- Sub-second latency - Iceberg optimized for batch/analytical queries
- Small datasets (<1GB) - Setup overhead not worth it
- Unstructured data - Store files directly in R2, not as Iceberg tables
Architecture
┌─────────────────────────────────────────────────┐
│ Query Engines │
│ (PyIceberg, Spark, Trino, Snowflake, DuckDB) │
└────────────────┬────────────────────────────────┘
│
│ REST API (OAuth2 token)
▼
┌─────────────────────────────────────────────────┐
│ R2 Data Catalog (Managed Iceberg REST Catalog)│
│ • Namespace/table metadata │
│ • Transaction coordination │
│ • Snapshot management │
└────────────────┬────────────────────────────────┘
│
│ Vended credentials
▼
┌─────────────────────────────────────────────────┐
│ R2 Bucket Storage │
│ • Parquet data files │
│ • Metadata files │
│ • Manifest files │
└─────────────────────────────────────────────────┘
Key concepts:
- Catalog URI - REST endpoint for catalog operations (e.g.,
https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>) - Warehouse - Logical grouping of tables (typically same as bucket name)
- Namespace - Schema/database containing tables (e.g.,
logs,analytics) - Table - Iceberg table with schema, data files, snapshots
- Vended credentials - Temporary S3 credentials catalog provides for data access
Limits
| Resource | Limit | Notes |
|---|---|---|
| Namespaces per catalog | No hard limit | Organize tables logically |
| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
| Files per table | <100,000 recommended | Run compaction regularly |
| Snapshots per table | Configurable retention | Expire >7 days old |
| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
| Table size | Same as R2 bucket | 10GB-10TB+ common |
| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
| Target file size | 128-512 MB | After compaction |
Current Status
Public Beta (as of Jan 2026)
- Available to all R2 subscribers
- No extra cost beyond standard R2 storage/operations
- Production-ready, but breaking changes possible
- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
Decision Tree: Is R2 Data Catalog Right For You?
Start → Need analytics on object storage data?
│
├─ No → Use R2 directly for object storage
│
└─ Yes → Dataset >1GB with structured schema?
│
├─ No → Too small, use R2 + ad-hoc queries
│
└─ Yes → Need ACID transactions or schema evolution?
│
├─ No → Consider simpler solutions (Parquet on R2)
│
└─ Yes → Need multi-cloud/multi-tool access?
│
├─ No → D1 or external DB may be simpler
│
└─ Yes → ✅ Use R2 Data Catalog
Quick check: If you answer "yes" to all:
- Dataset >1GB and growing
- Structured/tabular data (logs, events, metrics)
- Multiple query tools or cloud environments
- Need versioning, schema changes, or concurrent access
→ R2 Data Catalog is a good fit.
In This Reference
- configuration.md - Enable catalog, create API tokens, connect clients
- api.md - REST endpoints, operations, maintenance
- patterns.md - PyIceberg examples, common use cases
- gotchas.md - Troubleshooting, best practices, limitations