Skip to content

AEGIS Mathematical Coherence Review

Version: 1.2.0 Date: 2026-01-31 Protocol: Rigor v1.0 + Claude-GPT Dialogical Collaboration Mode: ideation-critique GPT Reasoning: xhigh


Executive Summary

This document captures a multi-model validation of AEGIS's mathematical foundations to verify statistical methods, formula coherence, integration soundness, and identify potential gaps or failure modes.


Round 1: Claude Opus 4.5 Initial Analysis

1. Bayesian Risk/Profit Gates

1.1 Implementation Review

The conjugate normal-normal model is correctly implemented:

posterior_precision = 1/σ₀² + 1/σ_L²
posterior_variance = 1/posterior_precision
μ_post = posterior_variance × (μ/σ₀² + observed/σ_L²)
P(Δ  threshold) = 1 - Φ((threshold - μ_post) / σ_post)

Mathematical Verification: - The precision-weighted update formula is the standard conjugate prior result - For Normal-Normal with known variance: this is textbook-correct - The CDF computation via 1 - Φ(z) correctly gives the upper tail probability

1.2 Concerns Identified

Concern B1: Model Appropriateness - Normal distribution assumes symmetric, unbounded risk deltas - Real-world risk deltas may be: - Bounded (can't lose more than 100%) - Skewed (tail risks are typically asymmetric) - Fat-tailed (extreme events more likely than normal predicts)

Recommendation: Consider log-normal or beta distribution for bounded deltas, or use robust estimators (e.g., Huber loss for outlier resistance).

Concern B2: Default Hyperparameters - μ₀=0 (no change expected): Reasonable neutral prior - σ₀=1.0: This implies 95% prior credible interval of [-1.96, 1.96] for normalized delta - σ_L=0.5: Observation noise ≈ 0.5 implies fairly high trust in observations

Issue: The ratio σ₀²/σ_L² = 1/0.25 = 4 means observations have 4× the precision of the prior. A single observation substantially shifts the posterior. This may be too reactive.

Recommendation: Document the implicit assumption that observations are high-signal. Consider σ_L=1.0 for noisier domains.

Concern B3: Boundary Behavior at P ≈ 0.95 - The gate passes if P(Δ ≥ 2) < 0.95 - At P = 0.9499, gate passes; at P = 0.9501, gate fails - This discrete threshold creates sensitivity to small changes

Recommendation: Consider graduated response (e.g., warning zone at 0.90-0.95) or hysteresis for state transitions.

2. PERT Three-Point Estimation

2.1 Implementation Review

E[X] = (a + 4m + b) / 6
Var[X] = ((b - a) / 6)²

Mathematical Verification: - The expected value formula is the modified PERT formula (standard) - The variance formula is the simplified PERT variance

2.2 Concerns Identified

Concern P1: Variance Approximation

The true PERT beta distribution has: - Shape parameters: α = 1 + 4×(m-a)/(b-a), β = 1 + 4×(b-m)/(b-a) - True variance: Var = (α×β) / ((α+β)²×(α+β+1)) × (b-a)²

The simplified formula ((b-a)/6)² is only accurate when m = (a+b)/2 (symmetric).

Quantitative Impact: - For symmetric (a=0, m=5, b=10): True Var ≈ 1.39, Simplified Var = 2.78 → 2× overestimate - For skewed (a=0, m=2, b=10): True Var ≈ 2.86, Simplified Var = 2.78 → Close - For highly skewed (a=0, m=1, b=10): True Var ≈ 3.47, Simplified Var = 2.78 → 20% underestimate

Recommendation: For governance where tail risks matter, use the true beta variance formula or document that the simplification is conservative for symmetric estimates but may underestimate for highly skewed ones.

Concern P2: Sensitivity to Mode (m)

The expected value E[X] = (a + 4m + b) / 6 weights m by 4×. - Small changes in m have large impact on expected value - No guidance on how to elicit m reliably

Recommendation: Add sensitivity analysis or document uncertainty in m as a separate variance component.

3. Utility Function (Rubric v2.1)

3.1 Implementation Review

U = (ΔP_H + γ·ΔV_L) + κ·ΔR - (φ_S·ΔC_S + φ_D·ΔC_D) - ΔOPEX
LCB = U - z_α × Var(U)
Var(U) = Var(P) + γ²·Var(V) + κ²·Var(R)

Mathematical Verification: - The LCB formula assumes U is approximately normal (CLT-based justification) - Variance propagation uses: Var(aX + bY) = a²Var(X) + b²Var(Y) + 2ab·Cov(X,Y)

3.2 Concerns Identified

Concern U1: Independence Assumption

The variance calculation:

variance = Var(P) + γ²·Var(V) + κ²·Var(R)

Implicitly assumes Cov(P,V) = Cov(P,R) = Cov(V,R) = 0

This is likely violated: - Profit and Value are often positively correlated - Profit and Risk may be negatively correlated (high-risk = high-reward) - Value and Risk may have domain-specific relationships

Impact: If components are positively correlated, true variance is underestimated, leading to: - LCB being too optimistic - More proposals passing that shouldn't

Recommendation: 1. At minimum, add a correlation adjustment factor: Var(U) = Var(P) + γ²Var(V) + κ²Var(R) + ρ_adj × σ_P × σ_V × γ 2. Or document the independence assumption explicitly and monitor empirically

Concern U2: Kappa (κ) Only for Risk Reduction

kappa_effective = self.kappa if delta_R < 0 else 0.0

This means: - Risk reduction (ΔR < 0) adds κ×|ΔR| to utility - Risk increase (ΔR > 0) has zero penalty beyond gate failure

Logical Inconsistency: If risk increases, shouldn't there be a negative utility contribution? The current design relies entirely on the risk gate to block high-risk proposals, but doesn't penalize moderate risk increases in the utility calculation.

Recommendation: Consider: kappa_effective = self.kappa * delta_R (always apply, negative for increase, positive for decrease).

Concern U3: Complexity Cost Exclusion from Variance

Complexity costs (φ_S·ΔC_S + φ_D·ΔC_D) are treated as deterministic constants in variance.

Issue: Complexity estimates also have uncertainty. If complexity is misestimated by 20%, the utility swing could be significant (φ_D = 2000 $/point).

Recommendation: If complexity estimates have standard errors, include them: Var(Complexity) = φ_S²·Var(C_S) + φ_D²·Var(C_D).

4. KL Divergence Drift Detection

4.1 Implementation Review

def _compute_kl_divergence(self, p, q):
    kl = 0.0
    for p_i, q_i in zip(p, q):
        p_i = max(p_i, epsilon)  # epsilon = 1e-10
        q_i = max(q_i, epsilon)
        if p_i > 0:
            kl += p_i * math.log(p_i / q_i)
    return kl

Mathematical Verification: - The formula D_KL(P||Q) = Σ P(x) log(P(x)/Q(x)) is correct - Epsilon smoothing prevents log(0)

4.2 Concerns Identified

Concern K1: Asymmetry of KL Divergence

D_KL(P||Q) ≠ D_KL(Q||P)

In drift detection: - P = current distribution (what we're monitoring) - Q = baseline (historical reference)

Issue: KL(P||Q) measures how much P diverges from Q, but: - If Q has zeros where P doesn't → KL undefined (epsilon masks this) - The direction matters: KL is sensitive to mode-seeking vs mode-covering

Recommendation: Consider Jensen-Shannon divergence: JS = (KL(P||M) + KL(Q||M))/2 where M = (P+Q)/2. This is symmetric and bounded [0, log(2)].

Concern K2: Histogram Binning Sensitivity

num_bins: int = 50

Issues: - 50 bins may be too many for small samples (sparse bins → noisy KL) - 50 bins may be too few for high-resolution differences - Bin edges are determined by data range, not fixed percentiles

Recommendation: - Use Freedman-Diaconis rule for adaptive binning: bins = (max-min) / (2×IQR×n^(-1/3)) - Or use kernel density estimation for smoother distributions

Concern K3: Non-Stationarity

Issue: KL divergence assumes both P and Q are stationary. If the baseline drifts gradually (regime change), the detection may: - Fire false positives during transition - Miss slow drifts that don't exceed threshold

Recommendation: Add trend-adjusted baselines or exponentially-weighted moving baselines.

5. Six-Gate Integration

5.1 Implementation Review

Gates evaluated in sequence: 1. Risk Gate (Bayesian posterior) 2. Profit Gate (Bayesian posterior) 3. Novelty Gate (Logistic) 4. Complexity Floor (Hard constraint) 5. Quality Gate (Threshold + subscores) 6. Utility Gate (LCB threshold)

Override eligibility: not all_passed and gates[COMPLEXITY].passed

5.2 Concerns Identified

Concern G1: Gate Ordering

Question: Does evaluation order matter?

Analysis: In current implementation, all gates are evaluated regardless of earlier failures. Order doesn't affect pass/fail outcome.

However: If short-circuit evaluation were added (for performance), order would matter. Current design is robust.

Concern G2: Novelty Gate Threshold Analysis

exponent = -k * (N - N0)  # k=10, N0=0.7
gate_value = 1 / (1 + exp(exponent))
passed = gate_value >= 0.8

Solving for passing threshold: - G(N) ≥ 0.8 requires: 1/(1+e^(-10(N-0.7))) ≥ 0.8 - e^(-10(N-0.7)) ≤ 0.25 - -10(N-0.7) ≤ -1.386 - N ≥ 0.7 + 0.139 = 0.839

The novelty gate passes for N ≥ 0.839 (approximately)

Concern: The logistic is very steep (k=10). Between N=0.8 and N=0.85: - N=0.80: G=0.731 (FAIL) - N=0.84: G=0.802 (PASS) - N=0.85: G=0.818 (PASS)

Very small changes in novelty score cross the threshold.

Recommendation: Consider lower k (e.g., k=5) for smoother transition, or document that novelty scoring must be precise.

Concern G3: Override Logic Soundness

override_eligible = not all_passed and gates[COMPLEXITY].passed

Interpretation: Override is possible if: - Not all gates passed (obviously, else no need for override) - Complexity floor passed (complexity is non-overridable)

This is logically sound. The complexity gate serves as a hard barrier that human override cannot bypass.

Edge Case: What if ONLY complexity fails? - all_passed = False, gates[COMPLEXITY].passed = False - override_eligible = False ✓ Correct

What if complexity passes but utility and risk fail? - all_passed = False, gates[COMPLEXITY].passed = True - override_eligible = True ✓ Correct

Verified: The override logic is sound.

Concern G4: Confidence Aggregation

min_confidence = min(confidences) if confidences else 1.0

Issue: Minimum confidence across gates loses information. If 5 gates have 0.99 confidence and 1 has 0.5, the min is 0.5.

Alternative: Consider weighted geometric mean or use the most relevant confidence for the failing gate(s).

6. Numerical Stability

6.1 Implementation Review

  • exp() clamped to [-700, 700]: Prevents overflow (e^709 ≈ 10^308)
  • epsilon = 1e-10 for log smoothing: Prevents log(0)
  • epsilon = 0.01 for baseline normalization: Prevents division by zero

6.2 Concerns Identified

Concern N1: Epsilon Inconsistency

Different epsilon values are used: - 1e-10 for KL divergence (very small) - 0.01 for baseline normalization (much larger)

Issue: 1e-10 could cause numerical instability with log: - log(1e-10) = -23 - log(1e-10 / 1e-10) = 0 ✓ - log(0.01 / 1e-10) = log(10^8) = 18.4

This is within safe ranges, but the inconsistency could be confusing.

Recommendation: Unify epsilon handling with documented rationale for each context.

Concern N2: CDF Tail Accuracy

def _standard_normal_cdf(self, z: float) -> float:
    return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))

Analysis: math.erf is accurate to 15-16 significant digits (double precision).

For extreme tails (|z| > 8), the CDF approaches 0 or 1 with limited precision: - Φ(8) ≈ 1 - 6.22e-16 (at machine epsilon) - Φ(-8) ≈ 6.22e-16

For governance thresholds at P=0.95 (z ≈ 1.645), accuracy is excellent. No concern.

7. Integration Point: pcw_decide()

7.1 Concerns Identified

Concern I1: Default Utility Result

if utility_result is None:
    utility_result = UtilityResult(
        raw=0.0,
        variance=0.01,
        lcb=0.1,
        components=default_components,
        decision_path="investment",
    )

Issue: Default LCB=0.1 means missing utility input → gate passes (if threshold=0).

Recommendation: Make utility_result required, or set lcb=-∞ for missing to force failure.

Concern I2: Confidence Aggregation Semantics

Using min(confidences) conflates different types of confidence: - Bayesian posterior probability (risk/profit gates) - Logistic gate output (novelty) - Binary pass/fail (complexity) - LCB margin (utility)

These are not directly comparable.

Recommendation: Report confidence per gate type rather than aggregating across semantically different measures.


Summary of Findings (Post-Dialogue Consensus)

Critical Issues (Require Immediate Attention)

Priority ID Component Issue Impact Status
1 NEW-A Bayesian Gates Posterior vs posterior predictive Overconfident risk assessment ✅ IMPLEMENTED (v3.11.0)
2 U1+ Utility Independence + incomplete covariance LCB miscalculation ✅ IMPLEMENTED (v3.11.0)
3 P1* PERT Variance error ±22-40% (corrected from Claude's 2×) Incorrect uncertainty bounds ✅ DOCUMENTED (v3.11.0)
4 I1 Integration Default utility passes (fail-open) Security vulnerability ✅ IMPLEMENTED (v3.11.0)

Significant Issues (Should Address)

ID Component Issue Recommendation
NEW-C Bayesian Unknown σ_L fragility Student-t / Normal-Inverse-Gamma
NEW-D Utility LCB normal approximation weak Monte Carlo quantile
NEW-E Utility κ toggle nonlinearity Smooth penalty or P(ΔR<0)
NEW-F KL Bin alignment critical Ensure identical bins
NEW-G KL Threshold uncalibrated Bootstrap false alarm rate
K1 KL Asymmetric measure Jensen-Shannon or Wasserstein
U2* Utility κ asymmetry (sign-dependent) Define sign convention

Minor Issues (Nice to Have)

ID Component Issue Status
B2 Bayesian Hyperparameter reactivity Document or tune
G2 Novelty Steep gate (k=10) Document precision requirement
G4 Gates Confidence aggregation Product or per-gate
NEW-H Gates Multiple testing Joint error reasoning
NEW-I Utility Units/scale coherence Dimensional analysis

Verified Sound (Both Models Agree)

  • ✓ Bayesian posterior formula (mathematically correct for latent parameter)
  • ✓ Override logic (logically sound)
  • ✓ Numerical stability (adequate for governance thresholds)
  • ✓ Gate ordering (independent, no short-circuit)
  • ✓ CDF tail accuracy (excellent for P=0.95)
  • ✓ PERT expected value formula (correct)

Round 2: GPT 5.2 Pro Critique

Status: COMPLETE ✅ Duration: 734 seconds (12.2 minutes) Reasoning Level: xhigh Timestamp: 2026-01-31T08:48:34

Corrections to Claude's Analysis

GPT 5.2 Pro identified several errors in the initial Claude analysis:

1. PERT Variance Numerical Errors (CRITICAL)

Claude's claim that the simplified PERT variance can be off by "2×" is incorrect.

GPT's Correction: For Beta-PERT with λ=4 (α = 1+4(m-a)/(b-a), β = 1+4(b-m)/(b-a)): - α + β = 6 always - True variance: αβ/252 × (b-a)² - Simplified variance: (b-a)²/36 - Ratio: αβ/7

Since αβ ∈ [5, 9], the ratio is [0.714, 1.286], meaning: - Worst case: overestimates by 40% (when m=a or m=b) - Or underestimates by 22% (when m is central) - NOT 2× as Claude claimed

Corrected Examples (a=0, b=10): | Mode (m) | α | β | True Var | Simplified Var | Error | |----------|---|---|----------|----------------|-------| | 5 (symmetric) | 3 | 3 | 3.571 | 2.778 | -22% (under) | | 2 | 1.8 | 4.2 | 3.000 | 2.778 | -8% (under) | | 1 | 1.4 | 4.6 | 2.556 | 2.778 | +9% (over) |

2. Covariance Formula Incomplete

Claude's suggested correlation adjustment was mathematically incomplete. Missing: - Factor of 2 in covariance terms - All covariance pairs

Correct Formula:

Var(U) = Var(P) + γ²Var(V) + κ²Var(R)
       + 2γ·Cov(P,V) + 2κ·Cov(P,R) + 2γκ·Cov(V,R)

3. Distribution Recommendations Underspecified

Claude's suggestion to use "log-normal or beta for bounded deltas" is problematic: - Log-normal has positive support only - Vanilla beta is [0,1] - Deltas can be negative

Better Alternatives: - Transformed Beta on [L, U] - Logit-normal / logistic transform - Truncated / skew-normal - Student-t likelihood for fat tails (recommended)

GPT's Agreement/Disagreement Matrix

ID Claude Concern GPT Stance Notes
B1 Model appropriateness Partial Direction right, alternatives underspecified
B2 Hyperparameters reactive Agree Posterior weight on observation is 0.8
B3 Boundary sensitivity Agree Hysteresis is standard remedy
P1 Variance approximation Agree with corrections Numbers were wrong
P2 Sensitivity to mode Agree Consider hierarchical model for m
U1 Independence assumption Agree But direction of bias depends on covariance signs
U2 κ asymmetry Partial Fix is sign-convention dependent
U3 Complexity in variance Agree Also applies to OPEX
K1 KL asymmetry Agree Also noted bin alignment issue
K2 Histogram binning Agree Keep bins consistent
K3 Non-stationarity Agree Add change-point methods
G1 Gate ordering Agree Robust if no short-circuit
G2 Novelty steepness Agree k=10 implies high sensitivity
G3 Override logic Agree Logically sound
G4 Confidence aggregation Partial Product is more coherent than min
N1 Epsilon inconsistency Minor agree Should be justified
N2 CDF accuracy Agree Fine for 0.95 threshold
I1 Default utility Strongly agree Fail-open is dangerous
I2 Confidence semantics Agree Report per-gate

Additional Issues Claude Missed (GPT)

GPT identified 9 additional mathematical issues:

A. Posterior vs Posterior Predictive (CRITICAL)

Claude verified the posterior for the latent parameter θ, but decisions should use the posterior predictive for future realized values.

Current (potentially wrong):

P(θ ≥ t | y) = 1 - Φ((t - μ_post) / σ_post)

Correct for realized outcomes:

P(Y_new ≥ t | y) = 1 - Φ((t - μ_post) / √(σ_post² + σ_L²))

The current implementation may be overconfident about realized outcomes.

B. Sample Size / Aggregation

If "observed" is an average of n observations, likelihood variance should be σ_L²/n. Many systems accidentally treat an average as a single observation.

C. Unknown σ_L (Fat Tails)

Taking σ_L as fixed is the main fragility. A Normal-Inverse-Gamma prior yields Student-t posterior predictive, which is more realistic for risk governance.

D. LCB Normal Approximation Weak

U is sum of ~3 uncertain components. CLT justification is thin. If any component is skewed/heavy-tailed: - "LCB = mean - zσ" is not the α-quantile - Better: Monte Carlo empirical quantile, or distribution-free bound (Cantelli)

E. Variance Propagation Ignores Nonlinearity

The κ toggle (only when ΔR < 0) makes the mapping nonlinear. If ΔR is uncertain near 0, treating κ as toggled by point estimate sign is mathematically inconsistent.

Recommendation: Incorporate P(ΔR < 0) or use smooth penalty.

F. KL Bin Alignment Essential

KL requires P and Q over the same discrete support. If histograms use different bin edges, zip(p, q) produces nonsense. This is a common implementation bug.

G. KL Threshold Calibration

Empirical KL between finite histograms is noisy. Without calibration (bootstrap under "no drift"), "KL > τ" has unknown false alarm rate.

H. Multiple Testing / Selection Bias

With many gates, overall false rejection/acceptance behavior is not obvious. Should reason about joint error if confidences are probabilistic.

I. Units/Scale Coherence

Utility combines profit/value/risk with weights (γ, κ, φ_S, φ_D). Mathematically valid only if on commensurate scales (all dollars or all normalized). Needs dimensional analysis.

GPT's Overall Assessment

Local coherence: Many individual formulas (Normal-Normal posterior update, logistic mapping, LCB form) are standard in isolation.

Global coherence is weaker because the system mixes: - Bayesian tail probabilities (clear probabilistic semantics) - Heuristic scoring functions (logistic novelty "confidence") - Deterministic hard constraints - Normal-approximate LCB using partial variance propagation

without a single consistent probabilistic or decision-theoretic interpretation tying them together.

Most Serious Mathematical Risks (Priority Order): 1. Overconfidence from posterior vs posterior predictive 2. Uncalibrated uncertainty propagation in utility 3. PERT variance confusion (unclear distribution definition) 4. KL drift implementation hazards (bin alignment, calibration)

Bottom Line: AEGIS is a reasonable heuristic multi-gate framework, but not yet mathematically "tight" as a probabilistic governance system.


Round 3: Claude Revision

Status: COMPLETE ✅ Timestamp: 2026-01-31

Acknowledged Corrections

I acknowledge GPT 5.2 Pro's corrections to my initial analysis:

  1. PERT Variance: My numerical examples were incorrect. The true error range is ±22-40%, not 2×. I accept this correction.

  2. Covariance Formula: I omitted the factor of 2 and additional covariance pairs. The complete formula GPT provided is correct.

  3. Distribution Recommendations: My suggestion of "log-normal or beta" was underspecified for signed deltas. Student-t likelihood is a better recommendation for robustness.

Revised Summary of Findings

Based on the Claude-GPT dialogue, here is the consolidated assessment:

Critical Issues (Require Immediate Attention)

Priority ID Issue Impact Recommendation
1 NEW-A Posterior vs posterior predictive Overconfident risk assessment Use √(σ_post² + σ_L²) for realized outcomes
2 U1+ Variance independence + incomplete covariance LCB miscalculation Implement full covariance matrix with factor 2
3 P1* PERT variance error ±22-40% Incorrect uncertainty bounds Document or implement true beta variance
4 I1 Default utility passes Fail-open vulnerability Fail closed on missing input

Significant Issues (Should Address)

ID Issue Recommendation
NEW-C Unknown σ_L fragility Consider Normal-Inverse-Gamma / Student-t
NEW-D LCB normal approximation weak Use Monte Carlo for empirical quantile
NEW-E κ toggle nonlinearity Incorporate P(ΔR < 0) or smooth penalty
NEW-F KL bin alignment Ensure identical bins for P and Q
NEW-G KL threshold uncalibrated Bootstrap for false alarm rate
K1 KL asymmetry Consider Jensen-Shannon or Wasserstein
U2* κ asymmetry (sign-dependent) Define ΔR sign convention explicitly

Minor Issues (Nice to Have)

ID Issue Status
B2 Hyperparameter reactivity Document or tune empirically
G2 Novelty gate steepness Document precision requirement
G4 Confidence aggregation Use product or per-gate reporting
NEW-H Multiple testing Reason about joint error rates
NEW-I Units/scale coherence Dimensional analysis documentation

Revised Verification Status

Component Status Notes
Bayesian posterior formula ✅ Correct But use predictive for realized outcomes
Override logic ✅ Sound Boolean formula is consistent
Numerical stability ✅ Adequate For standard inputs
Gate ordering ✅ Independent No short-circuit evaluation
PERT expected value ✅ Correct (a + 4m + b) / 6
PERT variance ⚠️ Simplified ±22-40% error depending on mode
Variance propagation ❌ Incomplete Missing factor 2, covariance terms
LCB interpretation ⚠️ Approximate CLT weak with few terms
KL implementation ⚠️ Needs review Bin alignment critical

Consensus Points (Claude + GPT Agreement)

Both models agree on:

  1. Independence assumption is problematic - variance calculation ignores covariance
  2. Default utility should fail closed - fail-open on missing is dangerous
  3. KL divergence should be symmetric - Jensen-Shannon is better
  4. Confidence aggregation is semantically flawed - mixing incomparable measures
  5. Novelty gate is steep - small changes cross threshold
  6. Override logic is sound - complexity non-override is consistent
  7. Numerical stability is adequate - for governance thresholds

Disagreement Points (Resolved)

Topic Claude Initial GPT Correction Resolution
PERT variance error Up to 2× ±22-40% GPT correct
Covariance formula Incomplete Full formula with factor 2 GPT correct
Distribution alternatives Log-normal/beta Student-t/transformed GPT more practical

Appendix A: Mathematical Derivations

A.1 Bayesian Posterior for Normal-Normal

Given: - Prior: θ ~ N(μ₀, σ₀²) - Likelihood: x | θ ~ N(θ, σ_L²)

Posterior: θ | x ~ N(μ_post, σ_post²)

Where: - σ_post² = 1/(1/σ₀² + 1/σ_L²) - μ_post = σ_post² × (μ₀/σ₀² + x/σ_L²)

A.2 PERT Beta Distribution

The PERT beta distribution on [a, b] with mode m has: - α = 1 + 4(m-a)/(b-a) - β = 1 + 4(b-m)/(b-a)

Mean: μ = (a + 4m + b)/6 Variance: σ² = (b-a)² × αβ / ((α+β)²(α+β+1))

A.3 LCB Derivation

For U = Σ w_i X_i where X_i are independent: - E[U] = Σ w_i E[X_i] - Var(U) = Σ w_i² Var(X_i) (if independent)

LCB at confidence 1-α: - LCB = E[U] - z_α × √Var(U)

Where z_α = Φ⁻¹(1-α). For α=0.05, z_α ≈ 1.645.


Appendix B: Test Evidence

Metric Value
Total Tests 846
Coverage 93.60%
Python Versions 3.9, 3.10, 3.11, 3.12
Security Scans 0 issues (bandit + safety)
Type Checking mypy --strict: 0 issues
Linting ruff: all passed

Change Log

Version Date Changes
1.2.0 2026-01-31 All 4 critical issues implemented; Status column added to issue tables
1.1.0 2026-01-31 Complete 3-round Claude-GPT dialogue; GPT corrections integrated; 9 new issues identified
1.0.0 2026-01-31 Initial Claude analysis

Appendix C: Dialogue Metadata

Metric Value
Mode ideation-critique
Rounds 3
GPT Reasoning xhigh
GPT Latency 734 seconds (12.2 min)
Total Issues Identified 22 (4 critical, 7 significant, 5 minor, 6 verified sound)
Claude Errors Corrected 3 (PERT variance, covariance formula, distribution recommendations)
New Issues from GPT 9 (labeled NEW-A through NEW-I)

Model Contributions

Model Role Key Contributions
Claude Opus 4.5 Initiator Initial mathematical review, identified 11 issues
GPT 5.2 Pro Critic Corrected 3 errors, identified 9 additional issues, provided overall coherence assessment
Claude Opus 4.5 Reviser Acknowledged corrections, consolidated findings, produced final consensus

Actionable Outcomes

  1. Priority 1: Fix posterior vs posterior predictive in Bayesian gates (NEW-A) - IMPLEMENTED v3.11.0
  2. Added compute_posterior_predictive() method
  3. Added use_predictive parameter to existing methods
  4. See: ADR-006-posterior-predictive.md
  5. Priority 2: Implement full covariance matrix in utility variance (U1+) - IMPLEMENTED v3.11.0
  6. Added optional cov_pv, cov_pr, cov_vr parameters
  7. Factor of 2 included per variance formula
  8. Priority 3: Document PERT variance limitations or implement true beta (P1) - DOCUMENTED v3.11.0*
  9. Enhanced docstring on ThreePointEstimate.variance
  10. Provides true Beta-PERT variance formula for reference
  11. Priority 4: Change default utility to fail-closed (I1) - IMPLEMENTED v3.11.0
  12. Default lcb=float('-inf') ensures missing utility fails gate
  13. Consider: ADR for statistical model improvements