LLM Observability: Monitoring & Tracing for AI Agents
LLMs are non-deterministic — classical APM tools fail. Moltbot delivers complete observability: prompt traces, quality metrics, security events and cost tracking — fully self-hosted.
What is LLM Observability? Simply Explained
LLM observability monitors non-deterministic AI systems: performance metrics like latency P50/P95/P99, time to first token and tokens per second measure model throughput. Cost metrics like token usage, cost per request and cache hit rate control budget. Quality & security metrics like hallucination rate, refusal rate, injection detection rate and PII exposure rate protect against quality degradation and attacks. Prompt traces create complete causal trace from user request to response. Without observability, AI systems are black boxes without insight.
↓ Jump to key metrics
Key Metrics
Performance
End-to-end response time per model and agent. Alert on regressions.
Streaming latency — time until first token arrives at client.
Model throughput. Critical for capacity planning and SLA.
% of context window used per request. Alert at 80%+ — quality degrades.
Cost
Input + output tokens per request, agent, user, and time period.
Calculated cost based on model pricing. Budget alerts per team/project.
Business-level metric: cost per successful task completion.
Semantic cache hits. Higher = lower cost. Track per prompt template.
Quality & Security
% responses flagged by factual consistency checker. Track per model version.
% requests refused by model. Spike = prompt engineering issue or injection attempt.
% inputs flagged as potential prompt injection. Spike = active attack.
% responses containing PII before redaction. Must be 0% in production.
Prometheus Integration
# Moltbot exposes Prometheus metrics at /metrics
# prometheus.yml scrape config:
scrape_configs:
- job_name: moltbot_llm
static_configs:
- targets: ['moltbot:9090']
metrics_path: /metrics
# Key metrics exposed:
# moltbot_llm_request_duration_seconds{model, agent, status}
# moltbot_llm_tokens_total{model, type} # type: input|output
# moltbot_llm_cost_usd_total{model, agent}
# moltbot_security_injections_detected_total
# moltbot_security_pii_redactions_total
# moltbot_agent_tool_calls_total{tool, agent, status}
# moltbot_hitl_pending_approvals
# Grafana dashboard import:
# ClawGuru LLM Dashboard ID: 21847 (grafana.com/dashboards)Frequently Asked Questions
Why is LLM observability different from traditional APM?
Traditional APM measures deterministic systems: same input → same output → same latency. LLMs are stochastic: same input can produce different outputs with different quality levels. This requires new metrics: hallucination rate (did the model make up facts?), refusal rate (is the model refusing valid requests?), semantic similarity (is the output meaningfully different from last week?). Traditional APM tools miss all of these. Moltbot's LLM observability layer was built specifically for probabilistic AI systems.
How does prompt tracing work?
Every LLM call is recorded with: input prompt (hashed + optionally stored), system message, model parameters (temperature, top_p, max_tokens), output tokens generated, latency breakdown (time to first token, generation time), tool calls made, security scan results, and a unique trace ID that links parent agent calls to child LLM calls. This creates a complete causal trace from user request → agent decision → LLM call → tool execution → response.
How do I detect LLM quality regressions?
Moltbot supports three quality regression detection methods: 1) Automated evals — run a fixed test set against every model/prompt change, compare output similarity to golden set. 2) Statistical process control — flag when hallucination rate or refusal rate exceeds 2-sigma from baseline. 3) User feedback correlation — link thumbs down / escalations to specific prompt versions and model settings. Any of these triggers a regression alert in your monitoring dashboard.
Can I run LLM observability without sending data to the cloud?
Yes — this is Moltbot's primary value proposition. All traces, metrics and logs are stored locally in your infrastructure (ClickHouse or PostgreSQL). The observability dashboard runs as a self-hosted web app. No data leaves your network. For air-gapped or high-security environments, Moltbot supports offline mode where even model calls go to local Ollama/LocalAI — full observability with zero external dependencies.