Skip to main content

LLM Observability

Hitler provides comprehensive observability for LLM interactions, including tracing, metrics collection, cost tracking, and A/B testing for prompt optimization.

Why Observability Matters

Cost Control

Track token usage and costs across models to optimize spending

Performance Monitoring

Monitor latency percentiles and identify bottlenecks

Quality Assurance

A/B test prompt variations to improve response quality

Debugging

Trace individual requests through the system

LLM Tracer

The LLMTracer class provides detailed tracing for all LLM calls.

Basic Usage

import { LLMTracer, getLLMTracer, withTracing } from "@hitler/prompts";

// Get the default tracer instance
const tracer = getLLMTracer();

// Start a trace manually
const trace = tracer.startTrace({
  model: "claude-3-5-sonnet-20241022",
  provider: "anthropic",
  callType: "chat",
  userId: "user-123",
  organizationId: "org-456",
  promptVersion: "v1.2.0",
});

// ... make LLM call ...

// Complete the trace with results
tracer.completeTrace(trace.traceId, {
  success: true,
  inputTokens: 150,
  outputTokens: 300,
});

Using the withTracing Wrapper

For cleaner code, use the withTracing helper:
import { withTracing } from '@hitler/prompts';

const result = await withTracing(
  {
    model: 'claude-3-5-sonnet-20241022',
    provider: 'anthropic',
    callType: 'chat',
    userId: context.userId,
    organizationId: context.orgId,
  },
  async () => {
    const response = await anthropic.messages.create({ ... });
    return {
      result: response.content[0].text,
      inputTokens: response.usage.input_tokens,
      outputTokens: response.usage.output_tokens,
    };
  }
);

Trace Properties

Each trace captures:
PropertyTypeDescription
traceIdstringUnique identifier for the trace
spanIdstringSpan ID for distributed tracing
parentSpanIdstring?Parent span for nested calls
startTimeDateWhen the call started
endTimeDateWhen the call completed
durationMsnumberTotal latency
modelstringModel used
providerstringProvider (anthropic, openai)
callTypestringType: chat, completion, embedding
inputTokensnumberInput token count
outputTokensnumberOutput token count
estimatedCostUsdnumberCalculated cost
successbooleanWhether call succeeded
errorstring?Error message if failed
userIdstring?User making request
organizationIdstring?Organization context
promptVersionstring?Version of prompt used

Custom Tracer Configuration

const tracer = new LLMTracer({
  // Custom cost config
  costConfig: {
    "claude-3-5-sonnet-20241022": {
      inputPer1kTokens: 0.003,
      outputPer1kTokens: 0.015,
    },
    "gpt-4o": {
      inputPer1kTokens: 0.005,
      outputPer1kTokens: 0.015,
    },
  },
  // Max traces to keep in memory
  maxTraces: 10000,
  // Callback when trace completes
  onTraceComplete: async (trace) => {
    // Persist to database, send to analytics, etc.
    await persistTrace(trace);
  },
});

Metrics Collection

Getting Metrics

const metrics = tracer.getMetrics(
  new Date("2026-02-01"), // Start time
  new Date("2026-02-05") // End time
);

Metrics Structure

{
  period: { start: Date, end: Date },
  totalCalls: 15000,
  successfulCalls: 14850,
  failedCalls: 150,
  successRate: 0.99,

  latency: {
    avg: 450,      // ms
    min: 120,
    max: 3200,
    p50: 380,
    p90: 750,
    p95: 1100,
    p99: 2100,
  },

  tokens: {
    totalInput: 2500000,
    totalOutput: 5000000,
    avgInput: 166,
    avgOutput: 333,
  },

  cost: {
    totalUsd: 125.50,
    avgPerCallUsd: 0.0084,
  },

  byModel: {
    'claude-3-5-sonnet-20241022': {
      calls: 10000,
      avgLatencyMs: 400,
      totalTokens: 5000000,
      totalCostUsd: 95.00,
    },
    'gpt-4o': {
      calls: 5000,
      avgLatencyMs: 550,
      totalTokens: 2500000,
      totalCostUsd: 30.50,
    },
  },

  byCallType: {
    chat: 14000,
    completion: 500,
    embedding: 500,
  },

  errors: {
    'rate_limit': 100,
    'timeout': 30,
    'invalid_request': 20,
  },
}

Cost Tracking

Built-in cost configuration for common models:
ModelInput $/1KOutput $/1K
claude-3-opus-20240229$0.015$0.075
claude-3-5-sonnet-20241022$0.003$0.015
claude-3-haiku-20240307$0.00025$0.00125
gpt-4-turbo$0.01$0.03
gpt-4o$0.005$0.015
gpt-4o-mini$0.00015$0.0006
gpt-3.5-turbo$0.0005$0.0015

A/B Testing Framework

Test different prompt variations to optimize response quality.

Creating an Experiment

import { ABTestingEngine, getABTestingEngine } from "@hitler/prompts";

const engine = getABTestingEngine();

// Create experiment
const experiment = engine.createExperiment({
  id: "exp-intent-detection-v2",
  name: "Intent Detection Prompt V2",
  description: "Testing more detailed examples in intent detection",
  promptComponent: "intent",
  minSampleSize: 100,
  metrics: ["latency", "positive_feedback", "intent_accuracy"],
  variants: [
    {
      id: "control",
      name: "Current Prompt",
      content: currentIntentPrompt,
      trafficPercent: 50,
      active: true,
    },
    {
      id: "variant-a",
      name: "More Examples",
      content: newIntentPromptWithMoreExamples,
      trafficPercent: 50,
      active: true,
    },
  ],
});

// Start the experiment
engine.startExperiment("exp-intent-detection-v2");

Getting Variant for Users

// Consistent assignment - same user always gets same variant
const variant = engine.getVariantForUser(experimentId, userId);

if (variant) {
  // Use variant.content as the prompt
  const prompt = variant.content;
}

// Or use the helper method
const { prompt, experimentId, variantId } = engine.getPromptWithExperiment(
  basePrompt,
  "intent",
  userId
);

Recording Results

// Record result after LLM call
engine.recordResult({
  experimentId: "exp-intent-detection-v2",
  variantId: "variant-a",
  userId: "user-123",
  sessionId: "session-456",
  latencyMs: 450,
  metrics: {
    intent_confidence: 0.95,
    response_quality: 8,
  },
  timestamp: new Date(),
});

// Record user feedback later
engine.recordFeedback(
  experimentId,
  variantId,
  sessionId,
  "positive" // or 'negative'
);

Analyzing Results

const stats = engine.getExperimentStats("exp-intent-detection-v2");

// Returns array of VariantStats:
[
  {
    variantId: "control",
    variantName: "Current Prompt",
    sampleSize: 500,
    avgLatencyMs: 480,
    p95LatencyMs: 850,
    positiveFeedbackRate: 0.72,
    negativeFeedbackRate: 0.08,
    metricAverages: {
      intent_confidence: 0.88,
      response_quality: 7.2,
    },
  },
  {
    variantId: "variant-a",
    variantName: "More Examples",
    sampleSize: 500,
    avgLatencyMs: 520,
    p95LatencyMs: 920,
    positiveFeedbackRate: 0.81,
    negativeFeedbackRate: 0.05,
    metricAverages: {
      intent_confidence: 0.92,
      response_quality: 7.8,
    },
    significanceVsControl: {
      metric: "positive_feedback_rate",
      pValue: 0.02,
      significant: true, // p < 0.05
    },
  },
];

Experiment Lifecycle

// Create in draft status
engine.createExperiment({ ... });

// Start running
engine.startExperiment(experimentId);

// Pause if needed
engine.pauseExperiment(experimentId);

// Complete when done
engine.completeExperiment(experimentId);

Integration Example

Complete integration in a chat service:
import {
  getLLMTracer,
  getABTestingEngine,
  getSecurityMonitor,
  assessThreatLevel,
  validateOutput,
  withTracing,
} from "@hitler/prompts";

class ChatService {
  private tracer = getLLMTracer();
  private abEngine = getABTestingEngine();
  private monitor = getSecurityMonitor();

  async chat(message: string, context: ChatContext) {
    // 1. Security check
    const threat = assessThreatLevel(message);
    this.monitor.recordEvent({
      organizationId: context.orgId,
      userId: context.userId,
      flagged: threat.shouldFlag,
      blocked: threat.shouldBlock,
      threats: threat.details.threats,
    });

    if (threat.shouldBlock) {
      return { text: "hey! need help with tasks?" };
    }

    // 2. Get prompt (with A/B testing)
    const { prompt, experimentId, variantId } = this.abEngine.getPromptWithExperiment(
      this.baseSystemPrompt,
      "system",
      context.userId
    );

    // 3. Make LLM call with tracing
    const startTime = Date.now();
    const result = await withTracing(
      {
        model: "claude-3-5-sonnet-20241022",
        provider: "anthropic",
        callType: "chat",
        userId: context.userId,
        organizationId: context.orgId,
        experimentVariant: variantId,
      },
      () => this.callLLM(prompt, threat.details.sanitized)
    );

    // 4. Record A/B test result
    if (experimentId && variantId) {
      this.abEngine.recordResult({
        experimentId,
        variantId,
        userId: context.userId,
        sessionId: context.sessionId,
        latencyMs: Date.now() - startTime,
        metrics: { response_length: result.length },
        timestamp: new Date(),
      });
    }

    // 5. Validate output
    const validation = validateOutput(result, context.sessionId);

    return { text: validation.safe ? result : validation.sanitized };
  }
}

Dashboard Queries

Cost by Organization

SELECT
  organization_id,
  SUM(estimated_cost_usd) as total_cost,
  COUNT(*) as total_calls,
  AVG(duration_ms) as avg_latency
FROM llm_traces
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY organization_id
ORDER BY total_cost DESC;

Model Performance Comparison

SELECT
  model,
  COUNT(*) as calls,
  AVG(duration_ms) as avg_latency,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_latency,
  SUM(input_tokens + output_tokens) as total_tokens,
  AVG(CASE WHEN success THEN 1 ELSE 0 END) as success_rate
FROM llm_traces
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY model;

A/B Test Winner Detection

WITH variant_stats AS (
  SELECT
    variant_id,
    COUNT(*) as sample_size,
    AVG(CASE WHEN feedback = 'positive' THEN 1 ELSE 0 END) as positive_rate,
    AVG(latency_ms) as avg_latency
  FROM experiment_results
  WHERE experiment_id = 'exp-123'
  GROUP BY variant_id
)
SELECT
  *,
  CASE
    WHEN positive_rate > (SELECT positive_rate FROM variant_stats WHERE variant_id = 'control')
    THEN 'WINNER'
    ELSE 'LOSING'
  END as status
FROM variant_stats;

Best Practices

1

Always trace LLM calls

Use withTracing wrapper for all LLM interactions
2

Set meaningful prompt versions

Track which prompt version produced each response
3

Run experiments with sufficient sample size

At least 100 samples per variant for statistical significance
4

Monitor costs regularly

Set up alerts for unexpected cost increases
5

Archive old traces

Move traces older than 90 days to cold storage

Notification Deduplication

The TaskNotificationDedupService (apps/api/src/modules/jobs/task-notification-dedup.service.ts) prevents duplicate notifications from being sent for the same event. It uses Redis-based dedup keys to ensure that repeated events (e.g., multiple overdue checks for the same task) only generate one notification within a configurable window. This is critical for cron-driven jobs that run frequently and may process the same tasks multiple times.
LLM traces may contain sensitive user input. Ensure proper access controls and consider truncating input in traces for privacy compliance.