LLM Observability
Hitler provides comprehensive observability for LLM interactions, including tracing, metrics collection, cost tracking, and A/B testing for prompt optimization.
Why Observability Matters
Cost Control Track token usage and costs across models to optimize spending
Performance Monitoring Monitor latency percentiles and identify bottlenecks
Quality Assurance A/B test prompt variations to improve response quality
Debugging Trace individual requests through the system
LLM Tracer
The LLMTracer class provides detailed tracing for all LLM calls.
Basic Usage
import { LLMTracer , getLLMTracer , withTracing } from "@hitler/prompts" ;
// Get the default tracer instance
const tracer = getLLMTracer ();
// Start a trace manually
const trace = tracer . startTrace ({
model: "claude-3-5-sonnet-20241022" ,
provider: "anthropic" ,
callType: "chat" ,
userId: "user-123" ,
organizationId: "org-456" ,
promptVersion: "v1.2.0" ,
});
// ... make LLM call ...
// Complete the trace with results
tracer . completeTrace ( trace . traceId , {
success: true ,
inputTokens: 150 ,
outputTokens: 300 ,
});
Using the withTracing Wrapper
For cleaner code, use the withTracing helper:
import { withTracing } from '@hitler/prompts' ;
const result = await withTracing (
{
model: 'claude-3-5-sonnet-20241022' ,
provider: 'anthropic' ,
callType: 'chat' ,
userId: context . userId ,
organizationId: context . orgId ,
},
async () => {
const response = await anthropic . messages . create ({ ... });
return {
result: response . content [ 0 ]. text ,
inputTokens: response . usage . input_tokens ,
outputTokens: response . usage . output_tokens ,
};
}
);
Trace Properties
Each trace captures:
Property Type Description traceIdstring Unique identifier for the trace spanIdstring Span ID for distributed tracing parentSpanIdstring? Parent span for nested calls startTimeDate When the call started endTimeDate When the call completed durationMsnumber Total latency modelstring Model used providerstring Provider (anthropic, openai) callTypestring Type: chat, completion, embedding inputTokensnumber Input token count outputTokensnumber Output token count estimatedCostUsdnumber Calculated cost successboolean Whether call succeeded errorstring? Error message if failed userIdstring? User making request organizationIdstring? Organization context promptVersionstring? Version of prompt used
Custom Tracer Configuration
const tracer = new LLMTracer ({
// Custom cost config
costConfig: {
"claude-3-5-sonnet-20241022" : {
inputPer1kTokens: 0.003 ,
outputPer1kTokens: 0.015 ,
},
"gpt-4o" : {
inputPer1kTokens: 0.005 ,
outputPer1kTokens: 0.015 ,
},
},
// Max traces to keep in memory
maxTraces: 10000 ,
// Callback when trace completes
onTraceComplete : async ( trace ) => {
// Persist to database, send to analytics, etc.
await persistTrace ( trace );
},
});
Metrics Collection
Getting Metrics
const metrics = tracer . getMetrics (
new Date ( "2026-02-01" ), // Start time
new Date ( "2026-02-05" ) // End time
);
Metrics Structure
{
period : { start : Date , end : Date },
totalCalls : 15000 ,
successfulCalls : 14850 ,
failedCalls : 150 ,
successRate : 0.99 ,
latency : {
avg : 450 , // ms
min : 120 ,
max : 3200 ,
p50 : 380 ,
p90 : 750 ,
p95 : 1100 ,
p99 : 2100 ,
},
tokens : {
totalInput : 2500000 ,
totalOutput : 5000000 ,
avgInput : 166 ,
avgOutput : 333 ,
},
cost : {
totalUsd : 125.50 ,
avgPerCallUsd : 0.0084 ,
},
byModel : {
'claude-3-5-sonnet-20241022' : {
calls: 10000 ,
avgLatencyMs: 400 ,
totalTokens: 5000000 ,
totalCostUsd: 95.00 ,
},
'gpt-4o' : {
calls: 5000 ,
avgLatencyMs: 550 ,
totalTokens: 2500000 ,
totalCostUsd: 30.50 ,
},
},
byCallType : {
chat : 14000 ,
completion : 500 ,
embedding : 500 ,
},
errors : {
'rate_limit' : 100 ,
'timeout' : 30 ,
'invalid_request' : 20 ,
},
}
Cost Tracking
Built-in cost configuration for common models:
Model Input $/1K Output $/1K claude-3-opus-20240229 $0.015 $0.075 claude-3-5-sonnet-20241022 $0.003 $0.015 claude-3-haiku-20240307 $0.00025 $0.00125 gpt-4-turbo $0.01 $0.03 gpt-4o $0.005 $0.015 gpt-4o-mini $0.00015 $0.0006 gpt-3.5-turbo $0.0005 $0.0015
A/B Testing Framework
Test different prompt variations to optimize response quality.
Creating an Experiment
import { ABTestingEngine , getABTestingEngine } from "@hitler/prompts" ;
const engine = getABTestingEngine ();
// Create experiment
const experiment = engine . createExperiment ({
id: "exp-intent-detection-v2" ,
name: "Intent Detection Prompt V2" ,
description: "Testing more detailed examples in intent detection" ,
promptComponent: "intent" ,
minSampleSize: 100 ,
metrics: [ "latency" , "positive_feedback" , "intent_accuracy" ],
variants: [
{
id: "control" ,
name: "Current Prompt" ,
content: currentIntentPrompt ,
trafficPercent: 50 ,
active: true ,
},
{
id: "variant-a" ,
name: "More Examples" ,
content: newIntentPromptWithMoreExamples ,
trafficPercent: 50 ,
active: true ,
},
],
});
// Start the experiment
engine . startExperiment ( "exp-intent-detection-v2" );
Getting Variant for Users
// Consistent assignment - same user always gets same variant
const variant = engine . getVariantForUser ( experimentId , userId );
if ( variant ) {
// Use variant.content as the prompt
const prompt = variant . content ;
}
// Or use the helper method
const { prompt , experimentId , variantId } = engine . getPromptWithExperiment (
basePrompt ,
"intent" ,
userId
);
Recording Results
// Record result after LLM call
engine . recordResult ({
experimentId: "exp-intent-detection-v2" ,
variantId: "variant-a" ,
userId: "user-123" ,
sessionId: "session-456" ,
latencyMs: 450 ,
metrics: {
intent_confidence: 0.95 ,
response_quality: 8 ,
},
timestamp: new Date (),
});
// Record user feedback later
engine . recordFeedback (
experimentId ,
variantId ,
sessionId ,
"positive" // or 'negative'
);
Analyzing Results
const stats = engine . getExperimentStats ( "exp-intent-detection-v2" );
// Returns array of VariantStats:
[
{
variantId: "control" ,
variantName: "Current Prompt" ,
sampleSize: 500 ,
avgLatencyMs: 480 ,
p95LatencyMs: 850 ,
positiveFeedbackRate: 0.72 ,
negativeFeedbackRate: 0.08 ,
metricAverages: {
intent_confidence: 0.88 ,
response_quality: 7.2 ,
},
},
{
variantId: "variant-a" ,
variantName: "More Examples" ,
sampleSize: 500 ,
avgLatencyMs: 520 ,
p95LatencyMs: 920 ,
positiveFeedbackRate: 0.81 ,
negativeFeedbackRate: 0.05 ,
metricAverages: {
intent_confidence: 0.92 ,
response_quality: 7.8 ,
},
significanceVsControl: {
metric: "positive_feedback_rate" ,
pValue: 0.02 ,
significant: true , // p < 0.05
},
},
];
Experiment Lifecycle
// Create in draft status
engine . createExperiment ({ ... });
// Start running
engine . startExperiment ( experimentId );
// Pause if needed
engine . pauseExperiment ( experimentId );
// Complete when done
engine . completeExperiment ( experimentId );
Integration Example
Complete integration in a chat service:
import {
getLLMTracer ,
getABTestingEngine ,
getSecurityMonitor ,
assessThreatLevel ,
validateOutput ,
withTracing ,
} from "@hitler/prompts" ;
class ChatService {
private tracer = getLLMTracer ();
private abEngine = getABTestingEngine ();
private monitor = getSecurityMonitor ();
async chat ( message : string , context : ChatContext ) {
// 1. Security check
const threat = assessThreatLevel ( message );
this . monitor . recordEvent ({
organizationId: context . orgId ,
userId: context . userId ,
flagged: threat . shouldFlag ,
blocked: threat . shouldBlock ,
threats: threat . details . threats ,
});
if ( threat . shouldBlock ) {
return { text: "hey! need help with tasks?" };
}
// 2. Get prompt (with A/B testing)
const { prompt , experimentId , variantId } = this . abEngine . getPromptWithExperiment (
this . baseSystemPrompt ,
"system" ,
context . userId
);
// 3. Make LLM call with tracing
const startTime = Date . now ();
const result = await withTracing (
{
model: "claude-3-5-sonnet-20241022" ,
provider: "anthropic" ,
callType: "chat" ,
userId: context . userId ,
organizationId: context . orgId ,
experimentVariant: variantId ,
},
() => this . callLLM ( prompt , threat . details . sanitized )
);
// 4. Record A/B test result
if ( experimentId && variantId ) {
this . abEngine . recordResult ({
experimentId ,
variantId ,
userId: context . userId ,
sessionId: context . sessionId ,
latencyMs: Date . now () - startTime ,
metrics: { response_length: result . length },
timestamp: new Date (),
});
}
// 5. Validate output
const validation = validateOutput ( result , context . sessionId );
return { text: validation . safe ? result : validation . sanitized };
}
}
Dashboard Queries
Cost by Organization
SELECT
organization_id,
SUM (estimated_cost_usd) as total_cost,
COUNT ( * ) as total_calls,
AVG (duration_ms) as avg_latency
FROM llm_traces
WHERE created_at > NOW () - INTERVAL '30 days'
GROUP BY organization_id
ORDER BY total_cost DESC ;
SELECT
model,
COUNT ( * ) as calls,
AVG (duration_ms) as avg_latency,
PERCENTILE_CONT ( 0 . 95 ) WITHIN GROUP ( ORDER BY duration_ms) as p95_latency,
SUM (input_tokens + output_tokens) as total_tokens,
AVG ( CASE WHEN success THEN 1 ELSE 0 END ) as success_rate
FROM llm_traces
WHERE created_at > NOW () - INTERVAL '7 days'
GROUP BY model;
A/B Test Winner Detection
WITH variant_stats AS (
SELECT
variant_id,
COUNT ( * ) as sample_size,
AVG ( CASE WHEN feedback = 'positive' THEN 1 ELSE 0 END ) as positive_rate,
AVG (latency_ms) as avg_latency
FROM experiment_results
WHERE experiment_id = 'exp-123'
GROUP BY variant_id
)
SELECT
* ,
CASE
WHEN positive_rate > ( SELECT positive_rate FROM variant_stats WHERE variant_id = 'control' )
THEN 'WINNER'
ELSE 'LOSING'
END as status
FROM variant_stats;
Best Practices
Always trace LLM calls
Use withTracing wrapper for all LLM interactions
Set meaningful prompt versions
Track which prompt version produced each response
Run experiments with sufficient sample size
At least 100 samples per variant for statistical significance
Monitor costs regularly
Set up alerts for unexpected cost increases
Archive old traces
Move traces older than 90 days to cold storage
Notification Deduplication
The TaskNotificationDedupService (apps/api/src/modules/jobs/task-notification-dedup.service.ts) prevents duplicate notifications from being sent for the same event. It uses Redis-based dedup keys to ensure that repeated events (e.g., multiple overdue checks for the same task) only generate one notification within a configurable window. This is critical for cron-driven jobs that run frequently and may process the same tasks multiple times.
LLM traces may contain sensitive user input. Ensure proper access controls and consider truncating
input in traces for privacy compliance.