Skip to content

ADR-005: KL Divergence Threshold Calibration Method

Status: Accepted Date: 2025-12-26 Issue: #1 - GAP-DriftThreshold Deciders: Risk Team (Analytics), Data Science Related Gap: GAP-H1 (Drift Threshold Calibration)


Context

The Guardrail Framework uses KL divergence to detect distribution drift between the frozen baseline (training data distribution) and live production data. The current specification defines a provisional threshold τ = 0.5, but this value requires calibration based on observed data variability.

The Problem: - A threshold that is too sensitive (low τ) causes alert fatigue from false positives - A threshold that is too lenient (high τ) misses actual drift until model performance degrades - The provisional τ = 0.5 was chosen as a balanced starting point but lacks empirical grounding

Constraints: - Threshold must be calibrated before production launch (β-to-Prod milestone) - Calibration requires understanding of normal data variability - Must support multi-tier alerting (warning vs. critical)


Decision Drivers

  1. Minimize false positives - Alert fatigue undermines trust in the system
  2. Detect actual drift promptly - Before model performance degrades
  3. Use statistically grounded methodology - Reproducible and auditable
  4. Support tiered response - Different severity levels for different drift magnitudes
  5. Enable recalibration - Threshold should be adjustable as baseline changes

Considered Options

Option 1: Static Fixed Threshold

Description: Keep τ = 0.5 as a fixed constant.

Pros Cons
Simple to implement No adaptation to actual variability
Deterministic behavior May not fit this use case
Easy to explain Could cause high false positive rate

Verdict: Rejected - Does not account for actual data characteristics.

Option 2: Mean + k×σ Formula

Description: Set τ = μ + k×σ where μ and σ are computed from observed KL values, and k controls false positive rate.

Pros Cons
Statistically grounded Assumes normal distribution
Adjustable via k parameter Sensitive to outliers
Well-established method Requires sufficient data

Verdict: Considered - Good baseline method but may not handle heavy-tailed distributions.

Option 3: Percentile-Based Calibration (Selected)

Description: Set thresholds based on percentiles of observed KL values: - τ_warn = 90th percentile (warning level) - τ_crit = 99th percentile (critical level)

Pros Cons
Distribution-agnostic Requires representative sample
Naturally handles outliers Percentile choice is somewhat arbitrary
Intuitive interpretation Must recalibrate when baseline changes
Supports multi-tier alerting -

Verdict: Selected - Best balance of robustness and interpretability.

Option 4: Dynamic Adaptive Threshold

Description: Continuously adjust threshold based on model performance feedback loop.

Pros Cons
Optimal performance Complex implementation
Self-correcting Requires performance correlation
Adapts to changing conditions Harder to audit and explain

Verdict: Deferred - Recommended for future enhancement after baseline approach proven.


Decision

We will use percentile-based calibration with two-tier thresholds:

τ_warn = P90(observed KL values)  → Warning threshold
τ_crit = P99(observed KL values)  → Critical threshold

Threshold Tiers

Tier Condition Action
Normal KL < τ_warn No action, continue monitoring
Warning τ_warn ≤ KL < τ_crit Log event, highlight in dashboard
Critical KL ≥ τ_crit Alert stakeholders, evaluate kill-switch

Calibration Procedure

  1. Data Collection: Collect 30+ days of KL divergence values from shadow scoring
  2. Validation: Ensure <5% missing values, no systematic gaps
  3. Computation: Calculate P90 and P99 percentiles
  4. Back-testing: Verify alert rate is acceptable (~10% warning, ~1% critical)
  5. Documentation: Record rationale and statistical summary
  6. Deployment: Update parameter freeze with new values

Expected Outcomes

Based on industry research (Evidently AI, Arthur.ai, Deepchecks):

Scenario Expected τ_warn Expected τ_crit
Stable data pipeline 0.25 - 0.35 0.40 - 0.55
Moderate variability 0.30 - 0.40 0.50 - 0.65
High variability 0.35 - 0.50 0.60 - 0.80

Consequences

Positive

  • Reduced false positives: Thresholds adapt to actual data variability
  • Clear escalation path: Two-tier system prevents alert fatigue
  • Auditable: Percentile-based method is transparent and reproducible
  • Actionable alerts: Critical alerts are rare enough to demand attention

Negative

  • Requires recalibration: When baseline distribution changes, thresholds must be recomputed
  • Data dependency: Need 30+ days of representative data before production calibration
  • Percentile choice: 90th/99th percentiles are reasonable defaults but may need adjustment

Risks

Risk Mitigation
Insufficient shadow data Extend collection period; use synthetic data for methodology validation
Non-representative sample Validate sample covers expected data patterns
Baseline changes frequently Implement automated recalibration pipeline

Validation

Pre-Deployment Validation

  • [ ] Calibration script produces correct percentiles on test data
  • [ ] Back-test shows acceptable alert rates
  • [ ] τ_crit validated against provisional τ = 0.5
  • [ ] Statistical rationale documented

Post-Deployment Validation

  • [ ] False positive rate < 5% after 7 days
  • [ ] No missed drift events (zero false negatives on known issues)
  • [ ] Stakeholder feedback on alert actionability

Implementation

  • Source: src/engine/drift.py - KL divergence computation
  • Shadow mode: src/integration/pcw_decide.py - shadow_mode=True collects KL divergence observations from live proposals without enforcing decisions (Phase 1 complete)
  • Tests: tests/integration/test_shadow_mode.py (44 tests), tests/test_engine.py (drift tests)
  • Metrics: src/telemetry/prometheus_exporter.py - drift metrics, shadow evaluation counter

  • Implementation Plan: /docs/implementation-plans/001-drift-threshold-calibration.md
  • Interface Contract: schema/interface-contract.yaml §frozen_parameters
  • Specification: spec/guardrails/Hardened_Quantitative_Guardrail_Framework_Specification.md §2.5

Changelog

Version Date Author Changes
1.0.0 2025-12-26 Claude Code Initial ADR
1.1.0 2026-02-09 Claude Code Shadow mode (Phase 1) implemented — pcw_decide(shadow_mode=True) collects KL divergence observations; 44 integration tests
1.0.1 2026-01-31 Claude Code Migrated to consolidated ADR directory, updated references