Cloudflare Workers AI
Expert guidance for Cloudflare Workers AI - serverless GPU-powered AI inference at the edge.
Overview
Workers AI provides:
- 50+ pre-trained models (LLMs, embeddings, image generation, speech-to-text, translation)
- Native Workers binding (no external API calls)
- Pay-per-use pricing (neurons consumed per inference)
- OpenAI-compatible REST API
- Streaming support for text generation
- Function calling with compatible models
Architecture: Inference runs on Cloudflare's GPU network. Models load on first request (cold start 1-3s), subsequent requests are faster.
Quick Start
interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env) {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'What is Cloudflare?' }]
});
return Response.json(response);
}
};
# Setup - add binding to wrangler.jsonc
wrangler dev --remote # Must use --remote for AI
wrangler deploy
Model Selection Decision Tree
Text Generation (Chat/Completion)
Quality Priority:
- Best quality:
@cf/meta/llama-3.1-70b-instruct(expensive, ~2000 neurons) - Balanced:
@cf/meta/llama-3.1-8b-instruct(good quality, ~200 neurons) - Fastest/cheapest:
@cf/mistral/mistral-7b-instruct-v0.1(~50 neurons)
Function Calling:
- Use
@cf/meta/llama-3.1-8b-instructor@cf/meta/llama-3.1-70b-instruct(native tool support)
Code Generation:
- Use
@cf/deepseek-ai/deepseek-coder-6.7b-instruct(specialized for code)
Embeddings (Semantic Search/RAG)
English text:
- Best:
@cf/baai/bge-large-en-v1.5(1024 dims, highest quality) - Balanced:
@cf/baai/bge-base-en-v1.5(768 dims, good quality) - Fast:
@cf/baai/bge-small-en-v1.5(384 dims, lower quality but fast)
Multilingual:
- Use
@hf/sentence-transformers/paraphrase-multilingual-minilm-l12-v2
Image Generation
- Stable Diffusion:
@cf/stabilityai/stable-diffusion-xl-base-1.0(~10,000 neurons) - Portraits:
@cf/lykon/dreamshaper-8-lcm(optimized for faces)
Other Tasks
- Speech-to-text:
@cf/openai/whisper - Translation:
@cf/meta/m2m100-1.2b(100 languages) - Image classification:
@cf/microsoft/resnet-50
SDK Approach Decision Tree
Native Binding (Recommended)
When: Building Workers/Pages with TypeScript
Why: Zero external dependencies, best performance, native types
await env.AI.run(model, input);
REST API
When: External services, non-Workers environments, testing
Why: Standard HTTP, works anywhere
curl https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/ai/run/@cf/meta/llama-3.1-8b-instruct \
-H "Authorization: Bearer <API_TOKEN>" \
-d '{"messages":[{"role":"user","content":"Hello"}]}'
Vercel AI SDK Integration
When: Using Vercel AI SDK features (streaming UI, tool calling abstractions)
Why: Unified interface across providers
import { openai } from '@ai-sdk/openai';
const model = openai('model-name', {
baseURL: 'https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/ai/v1',
headers: { Authorization: 'Bearer <API_TOKEN>' }
});
RAG vs Direct Generation
Use RAG (Vectorize + Workers AI) When:
- Answering questions about specific documents/data
- Need factual accuracy from known corpus
- Context exceeds model's window (>4K tokens)
- Building knowledge base chat
Use Direct Generation When:
- Creative writing, brainstorming
- General knowledge questions
- Small context fits in prompt (<4K tokens)
- Cost optimization (RAG adds embedding + vector search costs)
Platform Limits
| Limit | Free Tier | Paid Plans |
|---|---|---|
| Neurons/day | 10,000 | Pay per use |
| Rate limit | Varies by model | Higher (contact support) |
| Context window | Model dependent (2K-8K) | Same |
| Streaming | ✅ Supported | ✅ Supported |
| Function calling | ✅ Supported (select models) | ✅ Supported |
Pricing: Free 10K neurons/day, then pay per neuron consumed (varies by model)
Common Tasks
// Streaming text generation
const stream = await env.AI.run(model, { messages, stream: true });
for await (const chunk of stream) {
console.log(chunk.response);
}
// Embeddings for RAG
const { data } = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: ['Query text', 'Document 1', 'Document 2']
});
// Function calling
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'What is the weather?' }],
tools: [{
type: 'function',
function: { name: 'getWeather', parameters: { ... } }
}]
});
Development Workflow
# Always use --remote for AI (local doesn't have models)
wrangler dev --remote
# Deploy to production
wrangler deploy
# View model catalog
# https://developers.cloudflare.com/workers-ai/models/
Reading Order
Start here: Quick Start above → configuration.md (setup)
Common tasks:
- First time setup: configuration.md → Add binding + deploy
- Choose model: Model Selection Decision Tree (above) → api.md
- Build RAG: patterns.md → Vectorize integration
- Optimize costs: Model Selection + gotchas.md (rate limits)
- Debugging: gotchas.md → Common errors
In This Reference
- configuration.md - wrangler.jsonc setup, TypeScript types, bindings, environment variables
- api.md - env.AI.run(), streaming, function calling, REST API, response types
- patterns.md - RAG with Vectorize, prompt engineering, batching, error handling, caching
- gotchas.md - Deprecated @cloudflare/ai package, rate limits, pricing, common errors
See Also
- vectorize - Vector database for RAG patterns
- ai-gateway - Caching, rate limiting, analytics for AI requests
- workers - Worker runtime and fetch handler patterns