LLM Observability

Hitler provides comprehensive observability for LLM interactions, including tracing, metrics collection, cost tracking, and A/B testing for prompt optimization.

Why Observability Matters

Cost Control

Track token usage and costs across models to optimize spending

Performance Monitoring

Monitor latency percentiles and identify bottlenecks

Quality Assurance

A/B test prompt variations to improve response quality

Debugging

Trace individual requests through the system

LLM Tracer

The LLMTracer class provides detailed tracing for all LLM calls.

Basic Usage

import { LLMTracer, getLLMTracer, withTracing } from "@hitler/prompts";

// Get the default tracer instance
const tracer = getLLMTracer();

// Start a trace manually
const trace = tracer.startTrace({
  model: "claude-3-5-sonnet-20241022",
  provider: "anthropic",
  callType: "chat",
  userId: "user-123",
  organizationId: "org-456",
  promptVersion: "v1.2.0",
});

// ... make LLM call ...

// Complete the trace with results
tracer.completeTrace(trace.traceId, {
  success: true,
  inputTokens: 150,
  outputTokens: 300,
});

Using the withTracing Wrapper

For cleaner code, use the withTracing helper:

import { withTracing } from '@hitler/prompts';

const result = await withTracing(
  {
    model: 'claude-3-5-sonnet-20241022',
    provider: 'anthropic',
    callType: 'chat',
    userId: context.userId,
    organizationId: context.orgId,
  },
  async () => {
    const response = await anthropic.messages.create({ ... });
    return {
      result: response.content[0].text,
      inputTokens: response.usage.input_tokens,
      outputTokens: response.usage.output_tokens,
    };
  }
);

Trace Properties

Each trace captures:

Property	Type	Description
`traceId`	string	Unique identifier for the trace
`spanId`	string	Span ID for distributed tracing
`parentSpanId`	string?	Parent span for nested calls
`startTime`	Date	When the call started
`endTime`	Date	When the call completed
`durationMs`	number	Total latency
`model`	string	Model used
`provider`	string	Provider (anthropic, openai)
`callType`	string	Type: chat, completion, embedding
`inputTokens`	number	Input token count
`outputTokens`	number	Output token count
`estimatedCostUsd`	number	Calculated cost
`success`	boolean	Whether call succeeded
`error`	string?	Error message if failed
`userId`	string?	User making request
`organizationId`	string?	Organization context
`promptVersion`	string?	Version of prompt used

Custom Tracer Configuration

const tracer = new LLMTracer({
  // Custom cost config
  costConfig: {
    "claude-3-5-sonnet-20241022": {
      inputPer1kTokens: 0.003,
      outputPer1kTokens: 0.015,
    },
    "gpt-4o": {
      inputPer1kTokens: 0.005,
      outputPer1kTokens: 0.015,
    },
  },
  // Max traces to keep in memory
  maxTraces: 10000,
  // Callback when trace completes
  onTraceComplete: async (trace) => {
    // Persist to database, send to analytics, etc.
    await persistTrace(trace);
  },
});

Metrics Collection

Getting Metrics

const metrics = tracer.getMetrics(
  new Date("2026-02-01"), // Start time
  new Date("2026-02-05") // End time
);

Metrics Structure

{
  period: { start: Date, end: Date },
  totalCalls: 15000,
  successfulCalls: 14850,
  failedCalls: 150,
  successRate: 0.99,

  latency: {
    avg: 450,      // ms
    min: 120,
    max: 3200,
    p50: 380,
    p90: 750,
    p95: 1100,
    p99: 2100,
  },

  tokens: {
    totalInput: 2500000,
    totalOutput: 5000000,
    avgInput: 166,
    avgOutput: 333,
  },

  cost: {
    totalUsd: 125.50,
    avgPerCallUsd: 0.0084,
  },

  byModel: {
    'claude-3-5-sonnet-20241022': {
      calls: 10000,
      avgLatencyMs: 400,
      totalTokens: 5000000,
      totalCostUsd: 95.00,
    },
    'gpt-4o': {
      calls: 5000,
      avgLatencyMs: 550,
      totalTokens: 2500000,
      totalCostUsd: 30.50,
    },
  },

  byCallType: {
    chat: 14000,
    completion: 500,
    embedding: 500,
  },

  errors: {
    'rate_limit': 100,
    'timeout': 30,
    'invalid_request': 20,
  },
}

Cost Tracking

Built-in cost configuration for common models:

Model	Input $/1K	Output $/1K
claude-3-opus-20240229	$0.015	$0.075
claude-3-5-sonnet-20241022	$0.003	$0.015
claude-3-haiku-20240307	$0.00025	$0.00125
gpt-4-turbo	$0.01	$0.03
gpt-4o	$0.005	$0.015
gpt-4o-mini	$0.00015	$0.0006
gpt-3.5-turbo	$0.0005	$0.0015

A/B Testing Framework

Test different prompt variations to optimize response quality.

Creating an Experiment

import { ABTestingEngine, getABTestingEngine } from "@hitler/prompts";

const engine = getABTestingEngine();

// Create experiment
const experiment = engine.createExperiment({
  id: "exp-intent-detection-v2",
  name: "Intent Detection Prompt V2",
  description: "Testing more detailed examples in intent detection",
  promptComponent: "intent",
  minSampleSize: 100,
  metrics: ["latency", "positive_feedback", "intent_accuracy"],
  variants: [
    {
      id: "control",
      name: "Current Prompt",
      content: currentIntentPrompt,
      trafficPercent: 50,
      active: true,
    },
    {
      id: "variant-a",
      name: "More Examples",
      content: newIntentPromptWithMoreExamples,
      trafficPercent: 50,
      active: true,
    },
  ],
});

// Start the experiment
engine.startExperiment("exp-intent-detection-v2");

Getting Variant for Users

// Consistent assignment - same user always gets same variant
const variant = engine.getVariantForUser(experimentId, userId);

if (variant) {
  // Use variant.content as the prompt
  const prompt = variant.content;
}

// Or use the helper method
const { prompt, experimentId, variantId } = engine.getPromptWithExperiment(
  basePrompt,
  "intent",
  userId
);

Recording Results

// Record result after LLM call
engine.recordResult({
  experimentId: "exp-intent-detection-v2",
  variantId: "variant-a",
  userId: "user-123",
  sessionId: "session-456",
  latencyMs: 450,
  metrics: {
    intent_confidence: 0.95,
    response_quality: 8,
  },
  timestamp: new Date(),
});

// Record user feedback later
engine.recordFeedback(
  experimentId,
  variantId,
  sessionId,
  "positive" // or 'negative'
);

Analyzing Results

const stats = engine.getExperimentStats("exp-intent-detection-v2");

// Returns array of VariantStats:
[
  {
    variantId: "control",
    variantName: "Current Prompt",
    sampleSize: 500,
    avgLatencyMs: 480,
    p95LatencyMs: 850,
    positiveFeedbackRate: 0.72,
    negativeFeedbackRate: 0.08,
    metricAverages: {
      intent_confidence: 0.88,
      response_quality: 7.2,
    },
  },
  {
    variantId: "variant-a",
    variantName: "More Examples",
    sampleSize: 500,
    avgLatencyMs: 520,
    p95LatencyMs: 920,
    positiveFeedbackRate: 0.81,
    negativeFeedbackRate: 0.05,
    metricAverages: {
      intent_confidence: 0.92,
      response_quality: 7.8,
    },
    significanceVsControl: {
      metric: "positive_feedback_rate",
      pValue: 0.02,
      significant: true, // p < 0.05
    },
  },
];

Experiment Lifecycle

// Create in draft status
engine.createExperiment({ ... });

// Start running
engine.startExperiment(experimentId);

// Pause if needed
engine.pauseExperiment(experimentId);

// Complete when done
engine.completeExperiment(experimentId);

Integration Example

Complete integration in a chat service:

import {
  getLLMTracer,
  getABTestingEngine,
  getSecurityMonitor,
  assessThreatLevel,
  validateOutput,
  withTracing,
} from "@hitler/prompts";

class ChatService {
  private tracer = getLLMTracer();
  private abEngine = getABTestingEngine();
  private monitor = getSecurityMonitor();

  async chat(message: string, context: ChatContext) {
    // 1. Security check
    const threat = assessThreatLevel(message);
    this.monitor.recordEvent({
      organizationId: context.orgId,
      userId: context.userId,
      flagged: threat.shouldFlag,
      blocked: threat.shouldBlock,
      threats: threat.details.threats,
    });

    if (threat.shouldBlock) {
      return { text: "hey! need help with tasks?" };
    }

    // 2. Get prompt (with A/B testing)
    const { prompt, experimentId, variantId } = this.abEngine.getPromptWithExperiment(
      this.baseSystemPrompt,
      "system",
      context.userId
    );

    // 3. Make LLM call with tracing
    const startTime = Date.now();
    const result = await withTracing(
      {
        model: "claude-3-5-sonnet-20241022",
        provider: "anthropic",
        callType: "chat",
        userId: context.userId,
        organizationId: context.orgId,
        experimentVariant: variantId,
      },
      () => this.callLLM(prompt, threat.details.sanitized)
    );

    // 4. Record A/B test result
    if (experimentId && variantId) {
      this.abEngine.recordResult({
        experimentId,
        variantId,
        userId: context.userId,
        sessionId: context.sessionId,
        latencyMs: Date.now() - startTime,
        metrics: { response_length: result.length },
        timestamp: new Date(),
      });
    }

    // 5. Validate output
    const validation = validateOutput(result, context.sessionId);

    return { text: validation.safe ? result : validation.sanitized };
  }
}

Dashboard Queries

Cost by Organization

SELECT
  organization_id,
  SUM(estimated_cost_usd) as total_cost,
  COUNT(*) as total_calls,
  AVG(duration_ms) as avg_latency
FROM llm_traces
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY organization_id
ORDER BY total_cost DESC;

Model Performance Comparison

SELECT
  model,
  COUNT(*) as calls,
  AVG(duration_ms) as avg_latency,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_latency,
  SUM(input_tokens + output_tokens) as total_tokens,
  AVG(CASE WHEN success THEN 1 ELSE 0 END) as success_rate
FROM llm_traces
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY model;

A/B Test Winner Detection

WITH variant_stats AS (
  SELECT
    variant_id,
    COUNT(*) as sample_size,
    AVG(CASE WHEN feedback = 'positive' THEN 1 ELSE 0 END) as positive_rate,
    AVG(latency_ms) as avg_latency
  FROM experiment_results
  WHERE experiment_id = 'exp-123'
  GROUP BY variant_id
)
SELECT
  *,
  CASE
    WHEN positive_rate > (SELECT positive_rate FROM variant_stats WHERE variant_id = 'control')
    THEN 'WINNER'
    ELSE 'LOSING'
  END as status
FROM variant_stats;

Best Practices

Always trace LLM calls

Use withTracing wrapper for all LLM interactions

Set meaningful prompt versions

Track which prompt version produced each response

Run experiments with sufficient sample size

At least 100 samples per variant for statistical significance

Monitor costs regularly

Set up alerts for unexpected cost increases

Archive old traces

Move traces older than 90 days to cold storage

Notification Deduplication

The TaskNotificationDedupService (apps/api/src/modules/jobs/task-notification-dedup.service.ts) prevents duplicate notifications from being sent for the same event. It uses Redis-based dedup keys to ensure that repeated events (e.g., multiple overdue checks for the same task) only generate one notification within a configurable window. This is critical for cron-driven jobs that run frequently and may process the same tasks multiple times.

LLM traces may contain sensitive user input. Ensure proper access controls and consider truncating input in traces for privacy compliance.

​LLM Observability

​Why Observability Matters

Cost Control

Performance Monitoring

Quality Assurance

Debugging

​LLM Tracer

​Basic Usage

​Using the withTracing Wrapper

​Trace Properties

​Custom Tracer Configuration

​Metrics Collection

​Getting Metrics

​Metrics Structure

​Cost Tracking

​A/B Testing Framework

​Creating an Experiment

​Getting Variant for Users

​Recording Results

​Analyzing Results

​Experiment Lifecycle

​Integration Example

​Dashboard Queries

​Cost by Organization

​Model Performance Comparison

​A/B Test Winner Detection

​Best Practices

​Notification Deduplication

LLM Observability

Why Observability Matters

LLM Tracer

Basic Usage

Using the withTracing Wrapper

Trace Properties

Custom Tracer Configuration

Metrics Collection

Getting Metrics

Metrics Structure

Cost Tracking

A/B Testing Framework

Creating an Experiment

Getting Variant for Users

Recording Results

Analyzing Results

Experiment Lifecycle

Integration Example

Dashboard Queries

Cost by Organization

Model Performance Comparison

A/B Test Winner Detection

Best Practices

Notification Deduplication