ADR-005: KL Divergence Threshold Calibration Method¶
Status: Accepted Date: 2025-12-26 Issue: #1 - GAP-DriftThreshold Deciders: Risk Team (Analytics), Data Science Related Gap: GAP-H1 (Drift Threshold Calibration)
Context¶
The Guardrail Framework uses KL divergence to detect distribution drift between the frozen baseline (training data distribution) and live production data. The current specification defines a provisional threshold τ = 0.5, but this value requires calibration based on observed data variability.
The Problem: - A threshold that is too sensitive (low τ) causes alert fatigue from false positives - A threshold that is too lenient (high τ) misses actual drift until model performance degrades - The provisional τ = 0.5 was chosen as a balanced starting point but lacks empirical grounding
Constraints: - Threshold must be calibrated before production launch (β-to-Prod milestone) - Calibration requires understanding of normal data variability - Must support multi-tier alerting (warning vs. critical)
Decision Drivers¶
- Minimize false positives - Alert fatigue undermines trust in the system
- Detect actual drift promptly - Before model performance degrades
- Use statistically grounded methodology - Reproducible and auditable
- Support tiered response - Different severity levels for different drift magnitudes
- Enable recalibration - Threshold should be adjustable as baseline changes
Considered Options¶
Option 1: Static Fixed Threshold¶
Description: Keep τ = 0.5 as a fixed constant.
| Pros | Cons |
|---|---|
| Simple to implement | No adaptation to actual variability |
| Deterministic behavior | May not fit this use case |
| Easy to explain | Could cause high false positive rate |
Verdict: Rejected - Does not account for actual data characteristics.
Option 2: Mean + k×σ Formula¶
Description: Set τ = μ + k×σ where μ and σ are computed from observed KL values, and k controls false positive rate.
| Pros | Cons |
|---|---|
| Statistically grounded | Assumes normal distribution |
| Adjustable via k parameter | Sensitive to outliers |
| Well-established method | Requires sufficient data |
Verdict: Considered - Good baseline method but may not handle heavy-tailed distributions.
Option 3: Percentile-Based Calibration (Selected)¶
Description: Set thresholds based on percentiles of observed KL values: - τ_warn = 90th percentile (warning level) - τ_crit = 99th percentile (critical level)
| Pros | Cons |
|---|---|
| Distribution-agnostic | Requires representative sample |
| Naturally handles outliers | Percentile choice is somewhat arbitrary |
| Intuitive interpretation | Must recalibrate when baseline changes |
| Supports multi-tier alerting | - |
Verdict: Selected - Best balance of robustness and interpretability.
Option 4: Dynamic Adaptive Threshold¶
Description: Continuously adjust threshold based on model performance feedback loop.
| Pros | Cons |
|---|---|
| Optimal performance | Complex implementation |
| Self-correcting | Requires performance correlation |
| Adapts to changing conditions | Harder to audit and explain |
Verdict: Deferred - Recommended for future enhancement after baseline approach proven.
Decision¶
We will use percentile-based calibration with two-tier thresholds:
τ_warn = P90(observed KL values) → Warning threshold
τ_crit = P99(observed KL values) → Critical threshold
Threshold Tiers¶
| Tier | Condition | Action |
|---|---|---|
| Normal | KL < τ_warn | No action, continue monitoring |
| Warning | τ_warn ≤ KL < τ_crit | Log event, highlight in dashboard |
| Critical | KL ≥ τ_crit | Alert stakeholders, evaluate kill-switch |
Calibration Procedure¶
- Data Collection: Collect 30+ days of KL divergence values from shadow scoring
- Validation: Ensure <5% missing values, no systematic gaps
- Computation: Calculate P90 and P99 percentiles
- Back-testing: Verify alert rate is acceptable (~10% warning, ~1% critical)
- Documentation: Record rationale and statistical summary
- Deployment: Update parameter freeze with new values
Expected Outcomes¶
Based on industry research (Evidently AI, Arthur.ai, Deepchecks):
| Scenario | Expected τ_warn | Expected τ_crit |
|---|---|---|
| Stable data pipeline | 0.25 - 0.35 | 0.40 - 0.55 |
| Moderate variability | 0.30 - 0.40 | 0.50 - 0.65 |
| High variability | 0.35 - 0.50 | 0.60 - 0.80 |
Consequences¶
Positive¶
- Reduced false positives: Thresholds adapt to actual data variability
- Clear escalation path: Two-tier system prevents alert fatigue
- Auditable: Percentile-based method is transparent and reproducible
- Actionable alerts: Critical alerts are rare enough to demand attention
Negative¶
- Requires recalibration: When baseline distribution changes, thresholds must be recomputed
- Data dependency: Need 30+ days of representative data before production calibration
- Percentile choice: 90th/99th percentiles are reasonable defaults but may need adjustment
Risks¶
| Risk | Mitigation |
|---|---|
| Insufficient shadow data | Extend collection period; use synthetic data for methodology validation |
| Non-representative sample | Validate sample covers expected data patterns |
| Baseline changes frequently | Implement automated recalibration pipeline |
Validation¶
Pre-Deployment Validation¶
- [ ] Calibration script produces correct percentiles on test data
- [ ] Back-test shows acceptable alert rates
- [ ] τ_crit validated against provisional τ = 0.5
- [ ] Statistical rationale documented
Post-Deployment Validation¶
- [ ] False positive rate < 5% after 7 days
- [ ] No missed drift events (zero false negatives on known issues)
- [ ] Stakeholder feedback on alert actionability
Implementation¶
- Source:
src/engine/drift.py- KL divergence computation - Shadow mode:
src/integration/pcw_decide.py-shadow_mode=Truecollects KL divergence observations from live proposals without enforcing decisions (Phase 1 complete) - Tests:
tests/integration/test_shadow_mode.py(44 tests),tests/test_engine.py(drift tests) - Metrics:
src/telemetry/prometheus_exporter.py- drift metrics, shadow evaluation counter
Related Documents¶
- Implementation Plan:
/docs/implementation-plans/001-drift-threshold-calibration.md - Interface Contract:
schema/interface-contract.yaml§frozen_parameters - Specification:
spec/guardrails/Hardened_Quantitative_Guardrail_Framework_Specification.md§2.5
Changelog¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2025-12-26 | Claude Code | Initial ADR |
| 1.1.0 | 2026-02-09 | Claude Code | Shadow mode (Phase 1) implemented — pcw_decide(shadow_mode=True) collects KL divergence observations; 44 integration tests |
| 1.0.1 | 2026-01-31 | Claude Code | Migrated to consolidated ADR directory, updated references |