"Not a Pentest" Trust-Anker: Rate limiting guide for your own LLM infrastructure.

Moltbot AI Security · LLM Rate Limiting

LLM Rate Limiting & DoS Protection for AI Gateways

Ollama, LocalAI and LiteLLM have no built-in rate limiting. A single unconstrained request can block a GPU for minutes and trigger dollar-expensive inference. Six protection layers, ready-to-use configurations.

Protection layers

Token

Budget unit (not requests)

Redis

State backend

429

HTTP on exceed

6 Rate Limiting Layers

Request Rate(Requests/minute per user)

Prevents brute-force prompt injection attempts and API abuse

limit: 60 req/min per user_id
burst: 10
algorithm: token_bucket

Token Budget(Input + output tokens per hour)

Controls LLM inference cost. Prevents one user from exhausting GPU capacity.

limit: 100,000 tokens/hour per user
daily_budget: 500,000 tokens
cost_cap: $5.00/day per user

Concurrent Requests(Simultaneous requests per user)

Prevents resource exhaustion when users run many parallel agents

max_concurrent: 5 per user
max_concurrent_global: 100
queue_timeout: 30s

Context Window(Max tokens per single request)

Prevents prompt stuffing attacks and runaway context costs

max_input_tokens: 32000
max_output_tokens: 4096
context_window_alert: 80%

Tool Call Depth(Max tool calls per agent run)

Prevents infinite loops and runaway agent execution

max_tool_calls: 20 per run
max_recursion_depth: 5
timeout: 120s per run

Model Tier(Access to expensive models)

Restricts GPT-4/Claude access to authorized users only

tier_free: [llama3.1-8b]
tier_pro: [llama3.1-70b, mistral-large]
tier_enterprise: [all]

Nginx + Moltbot Gateway Configuration

# nginx.conf — Layer 1: Network-level rate limiting
http {
  limit_req_zone $binary_remote_addr zone=llm_api:10m rate=60r/m;
  limit_req_zone $binary_remote_addr zone=llm_heavy:10m rate=5r/m;

  server {
    location /v1/chat/completions {
      limit_req zone=llm_api burst=10 nodelay;
      proxy_pass http://moltbot:8080;
      proxy_read_timeout 300s;  # Long timeout for LLM generation
    }
  }
}

# moltbot.rate-limits.yaml — Layer 2-6: Token-aware limiting
rate_limits:
  per_user:
    requests_per_minute: 60
    tokens_per_hour: 100000
    tokens_per_day: 500000
    max_concurrent: 5
    max_input_tokens: 32000
    max_output_tokens: 4096
    max_tool_calls_per_run: 20
    run_timeout_seconds: 120

  model_tiers:
    free: ["llama3.1:8b", "mistral:7b"]
    pro: ["llama3.1:70b", "mixtral:8x7b"]
    enterprise: ["*"]

  cost_caps:
    currency: USD
    per_user_daily: 5.00
    global_hourly: 50.00
    alert_threshold: 0.80  # Alert at 80% of cap

Frequently Asked Questions

Why is LLM rate limiting harder than traditional API rate limiting?

Traditional API rate limiting counts requests. LLM rate limiting must count tokens — because a single request can consume anywhere from 10 tokens to 128,000 tokens. A user making 10 requests/minute with 10,000 tokens each costs 1,000x more than a user making 100 requests/minute with 100 tokens each. Request-based limits miss this completely. Additionally: LLM inference is GPU-bound and expensive — a single unconstrained request can take minutes and cost dollars. Rate limiting must protect both server capacity and cost budget simultaneously.

How does token budget enforcement work in practice?

Moltbot's token budget enforcement: 1) Pre-request: estimate token count from input (exact for tokenized input, estimated for natural language). If estimate exceeds remaining budget → reject before sending to LLM. 2) During generation: streaming responses count output tokens in real-time. If output budget exceeded → stop generation, return partial response. 3) Post-request: actual token counts from LLM API recorded and deducted from budget. Budgets persist in Redis with sliding window (per-hour) and fixed window (per-day). Budget refresh is configurable: hourly reset, daily reset, or monthly allocation.

How do I protect a self-hosted Ollama instance from DoS?

Ollama has no built-in rate limiting or authentication. To protect self-hosted Ollama: 1) Never expose Ollama directly (port 11434) to the internet. 2) Put Moltbot or LiteLLM as a gateway in front of Ollama — all requests flow through the gateway. 3) Configure rate limits on the gateway (not Ollama). 4) Add nginx rate limiting as an additional layer: limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/m. 5) Use Tailscale or WireGuard to restrict network access to Ollama to authorized clients only.

What is the difference between rate limiting and throttling for LLMs?

Rate limiting: hard reject requests that exceed the limit (HTTP 429 Too Many Requests). Throttling: queue requests and delay them until capacity is available. For LLMs: use rate limiting for per-user token budgets (hard stop prevents cost overrun). Use throttling for global GPU capacity management (queue requests rather than reject — better UX when server is busy but not abused). Moltbot supports both: per-user rate limiting (hard) + global queue with max queue depth and timeout (soft throttling).

🔗 Further Resources

LLM Gateway Hardening

Full gateway hardening

LLM Observability

Monitor rate limit metrics

Zero Trust AI Agents

Per-agent token budgets

CVE-2023-44487

HTTP/2 Rapid Reset — DDoS

ClawGuru Security Team

✓ Verified

Security Research & Engineering · Rate Limiting Specialists

📅 Published: 28.04.2026🔄 Last reviewed: 28.04.2026

This guide is based on practical experience with LLM rate limiting implementations for AI systems in production environments. The described best practices have been proven in real deployments and continuously improved.

🔒 Verified by ClawGuru Security Team·All information fact-checked and peer-reviewed