Files
2026-01-30 03:04:10 +00:00
..
2026-01-30 03:04:10 +00:00
2026-01-30 03:04:10 +00:00
2026-01-30 03:04:10 +00:00
2026-01-30 03:04:10 +00:00
2026-01-30 03:04:10 +00:00

Cloudflare R2 Data Catalog Skill Reference

Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.

Reading Order

New to R2 Data Catalog? Start here:

  1. Read "What is R2 Data Catalog?" and "When to Use" below
  2. configuration.md - Enable catalog, create tokens
  3. patterns.md - PyIceberg setup and common patterns
  4. api.md - REST API reference as needed
  5. gotchas.md - Troubleshooting when issues arise

Quick reference? Jump to:

What is R2 Data Catalog?

R2 Data Catalog is a managed Apache Iceberg REST catalog built directly into R2 buckets. It provides:

  • Apache Iceberg tables - ACID transactions, schema evolution, time-travel queries
  • Zero-egress costs - Query from any cloud/region without data transfer fees
  • Standard REST API - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
  • No infrastructure - Fully managed, no catalog servers to run
  • Public beta - Available to all R2 subscribers, no extra cost beyond R2 storage

What is Apache Iceberg?

Open table format for analytics datasets in object storage. Features:

  • ACID transactions - Safe concurrent reads/writes
  • Metadata optimization - Fast queries without full scans
  • Schema evolution - Add/rename/delete columns without rewrites
  • Time-travel - Query historical snapshots
  • Partitioning - Organize data for efficient queries

When to Use

Use R2 Data Catalog for:

  • Log analytics - Store and query application/system logs
  • Data lakes/warehouses - Analytical datasets queried by multiple engines
  • BI pipelines - Aggregate data for dashboards and reports
  • Multi-cloud analytics - Share data across clouds without egress fees
  • Time-series data - Event streams, metrics, sensor data

Don't use for:

  • Transactional workloads - Use D1 or external database instead
  • Sub-second latency - Iceberg optimized for batch/analytical queries
  • Small datasets (<1GB) - Setup overhead not worth it
  • Unstructured data - Store files directly in R2, not as Iceberg tables

Architecture

┌─────────────────────────────────────────────────┐
│  Query Engines                                  │
│  (PyIceberg, Spark, Trino, Snowflake, DuckDB)  │
└────────────────┬────────────────────────────────┘
                 │
                 │ REST API (OAuth2 token)
                 ▼
┌─────────────────────────────────────────────────┐
│  R2 Data Catalog (Managed Iceberg REST Catalog)│
│  • Namespace/table metadata                     │
│  • Transaction coordination                     │
│  • Snapshot management                          │
└────────────────┬────────────────────────────────┘
                 │
                 │ Vended credentials
                 ▼
┌─────────────────────────────────────────────────┐
│  R2 Bucket Storage                              │
│  • Parquet data files                           │
│  • Metadata files                               │
│  • Manifest files                               │
└─────────────────────────────────────────────────┘

Key concepts:

  • Catalog URI - REST endpoint for catalog operations (e.g., https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>)
  • Warehouse - Logical grouping of tables (typically same as bucket name)
  • Namespace - Schema/database containing tables (e.g., logs, analytics)
  • Table - Iceberg table with schema, data files, snapshots
  • Vended credentials - Temporary S3 credentials catalog provides for data access

Limits

Resource Limit Notes
Namespaces per catalog No hard limit Organize tables logically
Tables per namespace <10,000 recommended Performance degrades beyond this
Files per table <100,000 recommended Run compaction regularly
Snapshots per table Configurable retention Expire >7 days old
Partitions per table 100-1,000 optimal Too many = slow metadata ops
Table size Same as R2 bucket 10GB-10TB+ common
API rate limits Standard R2 API limits Shared with R2 storage operations
Target file size 128-512 MB After compaction

Current Status

Public Beta (as of Jan 2026)

  • Available to all R2 subscribers
  • No extra cost beyond standard R2 storage/operations
  • Production-ready, but breaking changes possible
  • Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance

Decision Tree: Is R2 Data Catalog Right For You?

Start → Need analytics on object storage data?
         │
         ├─ No → Use R2 directly for object storage
         │
         └─ Yes → Dataset >1GB with structured schema?
                  │
                  ├─ No → Too small, use R2 + ad-hoc queries
                  │
                  └─ Yes → Need ACID transactions or schema evolution?
                           │
                           ├─ No → Consider simpler solutions (Parquet on R2)
                           │
                           └─ Yes → Need multi-cloud/multi-tool access?
                                    │
                                    ├─ No → D1 or external DB may be simpler
                                    │
                                    └─ Yes → ✅ Use R2 Data Catalog

Quick check: If you answer "yes" to all:

  • Dataset >1GB and growing
  • Structured/tabular data (logs, events, metrics)
  • Multiple query tools or cloud environments
  • Need versioning, schema changes, or concurrent access

→ R2 Data Catalog is a good fit.

In This Reference

  • configuration.md - Enable catalog, create API tokens, connect clients
  • api.md - REST endpoints, operations, maintenance
  • patterns.md - PyIceberg examples, common use cases
  • gotchas.md - Troubleshooting, best practices, limitations

See Also