ADR-005: KL Divergence Threshold Calibration Method¶

Status: Accepted Date: 2025-12-26 Issue: #1 - GAP-DriftThreshold Deciders: Risk Team (Analytics), Data Science Related Gap: GAP-H1 (Drift Threshold Calibration)

Context¶

The Guardrail Framework uses KL divergence to detect distribution drift between the frozen baseline (training data distribution) and live production data. The current specification defines a provisional threshold τ = 0.5, but this value requires calibration based on observed data variability.

The Problem: - A threshold that is too sensitive (low τ) causes alert fatigue from false positives - A threshold that is too lenient (high τ) misses actual drift until model performance degrades - The provisional τ = 0.5 was chosen as a balanced starting point but lacks empirical grounding

Constraints: - Threshold must be calibrated before production launch (β-to-Prod milestone) - Calibration requires understanding of normal data variability - Must support multi-tier alerting (warning vs. critical)

Decision Drivers¶

Minimize false positives - Alert fatigue undermines trust in the system
Detect actual drift promptly - Before model performance degrades
Use statistically grounded methodology - Reproducible and auditable
Support tiered response - Different severity levels for different drift magnitudes
Enable recalibration - Threshold should be adjustable as baseline changes

Considered Options¶

Option 1: Static Fixed Threshold¶

Description: Keep τ = 0.5 as a fixed constant.

Pros	Cons
Simple to implement	No adaptation to actual variability
Deterministic behavior	May not fit this use case
Easy to explain	Could cause high false positive rate

Verdict: Rejected - Does not account for actual data characteristics.

Option 2: Mean + k×σ Formula¶

Description: Set τ = μ + k×σ where μ and σ are computed from observed KL values, and k controls false positive rate.

Pros	Cons
Statistically grounded	Assumes normal distribution
Adjustable via k parameter	Sensitive to outliers
Well-established method	Requires sufficient data

Verdict: Considered - Good baseline method but may not handle heavy-tailed distributions.

Option 3: Percentile-Based Calibration (Selected)¶

Description: Set thresholds based on percentiles of observed KL values: - τ_warn = 90th percentile (warning level) - τ_crit = 99th percentile (critical level)

Pros	Cons
Distribution-agnostic	Requires representative sample
Naturally handles outliers	Percentile choice is somewhat arbitrary
Intuitive interpretation	Must recalibrate when baseline changes
Supports multi-tier alerting	-

Verdict: Selected - Best balance of robustness and interpretability.

Option 4: Dynamic Adaptive Threshold¶

Description: Continuously adjust threshold based on model performance feedback loop.

Pros	Cons
Optimal performance	Complex implementation
Self-correcting	Requires performance correlation
Adapts to changing conditions	Harder to audit and explain

Verdict: Deferred - Recommended for future enhancement after baseline approach proven.

Decision¶

We will use percentile-based calibration with two-tier thresholds:

τ_warn = P90(observed KL values)  → Warning threshold
τ_crit = P99(observed KL values)  → Critical threshold

Threshold Tiers¶

Tier	Condition	Action
Normal	KL < τ_warn	No action, continue monitoring
Warning	τ_warn ≤ KL < τ_crit	Log event, highlight in dashboard
Critical	KL ≥ τ_crit	Alert stakeholders, evaluate kill-switch

Calibration Procedure¶

Data Collection: Collect 30+ days of KL divergence values from shadow scoring
Validation: Ensure <5% missing values, no systematic gaps
Computation: Calculate P90 and P99 percentiles
Back-testing: Verify alert rate is acceptable (~10% warning, ~1% critical)
Documentation: Record rationale and statistical summary
Deployment: Update parameter freeze with new values

Expected Outcomes¶

Based on industry research (Evidently AI, Arthur.ai, Deepchecks):

Scenario	Expected τ_warn	Expected τ_crit
Stable data pipeline	0.25 - 0.35	0.40 - 0.55
Moderate variability	0.30 - 0.40	0.50 - 0.65
High variability	0.35 - 0.50	0.60 - 0.80

Consequences¶

Positive¶

Reduced false positives: Thresholds adapt to actual data variability
Clear escalation path: Two-tier system prevents alert fatigue
Auditable: Percentile-based method is transparent and reproducible
Actionable alerts: Critical alerts are rare enough to demand attention

Negative¶

Requires recalibration: When baseline distribution changes, thresholds must be recomputed
Data dependency: Need 30+ days of representative data before production calibration
Percentile choice: 90th/99th percentiles are reasonable defaults but may need adjustment

Risks¶

Risk	Mitigation
Insufficient shadow data	Extend collection period; use synthetic data for methodology validation
Non-representative sample	Validate sample covers expected data patterns
Baseline changes frequently	Implement automated recalibration pipeline

Validation¶

Pre-Deployment Validation¶

[ ] Calibration script produces correct percentiles on test data
[ ] Back-test shows acceptable alert rates
[ ] τ_crit validated against provisional τ = 0.5
[ ] Statistical rationale documented

Post-Deployment Validation¶

[ ] False positive rate < 5% after 7 days
[ ] No missed drift events (zero false negatives on known issues)
[ ] Stakeholder feedback on alert actionability

Implementation¶

Source: src/engine/drift.py - KL divergence computation
Shadow mode: src/integration/pcw_decide.py - shadow_mode=True collects KL divergence observations from live proposals without enforcing decisions (Phase 1 complete)
Tests: tests/integration/test_shadow_mode.py (44 tests), tests/test_engine.py (drift tests)
Metrics: src/telemetry/prometheus_exporter.py - drift metrics, shadow evaluation counter

Implementation Plan: /docs/implementation-plans/001-drift-threshold-calibration.md
Interface Contract: schema/interface-contract.yaml §frozen_parameters
Specification: spec/guardrails/Hardened_Quantitative_Guardrail_Framework_Specification.md §2.5

Changelog¶

Version	Date	Author	Changes
1.0.0	2025-12-26	Claude Code	Initial ADR
1.1.0	2026-02-09	Claude Code	Shadow mode (Phase 1) implemented — `pcw_decide(shadow_mode=True)` collects KL divergence observations; 44 integration tests
1.0.1	2026-01-31	Claude Code	Migrated to consolidated ADR directory, updated references