LLM Rate Limiting & DoS Protection for AI Gateways
Ollama, LocalAI and LiteLLM have no built-in rate limiting. A single unconstrained request can block a GPU for minutes and trigger dollar-expensive inference. Six protection layers, ready-to-use configurations.
6 Rate Limiting Layers
Prevents brute-force prompt injection attempts and API abuse
limit: 60 req/min per user_id burst: 10 algorithm: token_bucket
Controls LLM inference cost. Prevents one user from exhausting GPU capacity.
limit: 100,000 tokens/hour per user daily_budget: 500,000 tokens cost_cap: $5.00/day per user
Prevents resource exhaustion when users run many parallel agents
max_concurrent: 5 per user max_concurrent_global: 100 queue_timeout: 30s
Prevents prompt stuffing attacks and runaway context costs
max_input_tokens: 32000 max_output_tokens: 4096 context_window_alert: 80%
Prevents infinite loops and runaway agent execution
max_tool_calls: 20 per run max_recursion_depth: 5 timeout: 120s per run
Restricts GPT-4/Claude access to authorized users only
tier_free: [llama3.1-8b] tier_pro: [llama3.1-70b, mistral-large] tier_enterprise: [all]
Nginx + Moltbot Gateway Configuration
# nginx.conf — Layer 1: Network-level rate limiting
http {
limit_req_zone $binary_remote_addr zone=llm_api:10m rate=60r/m;
limit_req_zone $binary_remote_addr zone=llm_heavy:10m rate=5r/m;
server {
location /v1/chat/completions {
limit_req zone=llm_api burst=10 nodelay;
proxy_pass http://moltbot:8080;
proxy_read_timeout 300s; # Long timeout for LLM generation
}
}
}
# moltbot.rate-limits.yaml — Layer 2-6: Token-aware limiting
rate_limits:
per_user:
requests_per_minute: 60
tokens_per_hour: 100000
tokens_per_day: 500000
max_concurrent: 5
max_input_tokens: 32000
max_output_tokens: 4096
max_tool_calls_per_run: 20
run_timeout_seconds: 120
model_tiers:
free: ["llama3.1:8b", "mistral:7b"]
pro: ["llama3.1:70b", "mixtral:8x7b"]
enterprise: ["*"]
cost_caps:
currency: USD
per_user_daily: 5.00
global_hourly: 50.00
alert_threshold: 0.80 # Alert at 80% of capFrequently Asked Questions
Why is LLM rate limiting harder than traditional API rate limiting?
Traditional API rate limiting counts requests. LLM rate limiting must count tokens — because a single request can consume anywhere from 10 tokens to 128,000 tokens. A user making 10 requests/minute with 10,000 tokens each costs 1,000x more than a user making 100 requests/minute with 100 tokens each. Request-based limits miss this completely. Additionally: LLM inference is GPU-bound and expensive — a single unconstrained request can take minutes and cost dollars. Rate limiting must protect both server capacity and cost budget simultaneously.
How does token budget enforcement work in practice?
Moltbot's token budget enforcement: 1) Pre-request: estimate token count from input (exact for tokenized input, estimated for natural language). If estimate exceeds remaining budget → reject before sending to LLM. 2) During generation: streaming responses count output tokens in real-time. If output budget exceeded → stop generation, return partial response. 3) Post-request: actual token counts from LLM API recorded and deducted from budget. Budgets persist in Redis with sliding window (per-hour) and fixed window (per-day). Budget refresh is configurable: hourly reset, daily reset, or monthly allocation.
How do I protect a self-hosted Ollama instance from DoS?
Ollama has no built-in rate limiting or authentication. To protect self-hosted Ollama: 1) Never expose Ollama directly (port 11434) to the internet. 2) Put Moltbot or LiteLLM as a gateway in front of Ollama — all requests flow through the gateway. 3) Configure rate limits on the gateway (not Ollama). 4) Add nginx rate limiting as an additional layer: limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/m. 5) Use Tailscale or WireGuard to restrict network access to Ollama to authorized clients only.
What is the difference between rate limiting and throttling for LLMs?
Rate limiting: hard reject requests that exceed the limit (HTTP 429 Too Many Requests). Throttling: queue requests and delay them until capacity is available. For LLMs: use rate limiting for per-user token budgets (hard stop prevents cost overrun). Use throttling for global GPU capacity management (queue requests rather than reject — better UX when server is busy but not abused). Moltbot supports both: per-user rate limiting (hard) + global queue with max queue depth and timeout (soft throttling).