AEGISdocs
Deployment

Production Deployment Guide

Deploy AEGIS in production environments including Docker, Kubernetes, and AWS.

Version: 1.0.0 | Updated: 2026-02-09 | Status: Active

This guide covers deploying AEGIS in production environments including Docker, Kubernetes, and AWS.


1. Prerequisites

  • Python: 3.9+ (3.11 recommended for production)
  • pip: 21.0+
  • OS: Linux (Ubuntu 22.04+, Amazon Linux 2023) or macOS
  • Optional: Docker 24+, Kubernetes 1.28+, PostgreSQL 15+

Deployment Profiles

ProfileInstall CommandUse Case
Minimalpip install -e "."Evaluation only, zero dependencies
Standardpip install -e ".[engine,telemetry]"Production with metrics + scipy z-scores
Fullpip install -e ".[all]"All features including crypto + persistence
PQ-Hardenedpip install -e ".[crypto,pqc,persistence]"Post-quantum signatures + durable state

2. Installation Profiles

Minimal (Zero Dependencies)

pip install -e "."

Provides: pcw_decide(), CLI (aegis evaluate), gate evaluation, Bayesian posterior. No scipy (z-scores unavailable), no Prometheus metrics, no YAML config loading.

pip install -e ".[engine,telemetry,config]"

Adds:

  • engine: scipy for utility z-score computation
  • telemetry: prometheus_client for Prometheus metrics exporter
  • config: pyyaml for YAML configuration loading

Full

pip install -e ".[all]"

Adds all optional groups: engine, telemetry, config, mcp, crypto, pqc, persistence.

Optional Dependency Groups

GroupPackage(s)Purpose
enginescipyUtility z-score computation
telemetryprometheus_clientPrometheus metrics exporter
configpyyamlYAML configuration loading
mcppyyamlMCP server for AI agents
cryptobtclib, coincurveBIP-340 Schnorr signatures
pqcliboqs-pythonML-DSA-44, ML-KEM-768 (requires native liboqs)
persistencesqlalchemy, asyncpg, aiosqliteDurable workflow state

3. Configuration

Default Configuration

from aegis_governance import AegisConfig

# Uses frozen defaults matching schema/interface-contract.yaml
config = AegisConfig.default()

YAML Configuration

# Requires: pip install -e ".[config]"
config = AegisConfig.from_yaml("config.yaml")

Example config.yaml:

parameters:
  epsilon_R: 0.01
  epsilon_P: 0.01
  risk_trigger_factor: 2.0
  profit_trigger_factor: 2.0
  trigger_confidence_prob: 0.95
  novelty_gate:
    N0: 0.7
    k: 10.0
    output_threshold: 0.6
  complexity_floor: 0.5
  quality_min_score: 0.7

Dict Configuration

config = AegisConfig.from_dict({"epsilon_R": 0.02, "quality_min_score": 0.8})

Environment Variables

VariablePurposeDefault
AEGIS_CONFIG_PATHPath to YAML config fileNone (uses defaults)
AEGIS_METRICS_PORTMetrics server port9090
AEGIS_LOG_LEVELLogging levelINFO
DATABASE_URLPostgreSQL connection stringNone (in-memory)

Frozen Parameter Policy

schema/interface-contract.yaml is the authoritative source for parameter values. AegisConfig defaults match this file exactly. Runtime mutation is impossible (frozen dataclass). Parameter changes require formal recalibration approval.


4. Docker Deployment

Dockerfile

The repository includes a multi-stage Dockerfile:

docker build -t aegis-governance .

Key features:

  • Multi-stage build (smaller image)
  • Non-root user (aegis, UID 1000)
  • Health check via aegis version
  • Exposes port 9090 for metrics

Docker Compose

Start AEGIS with Prometheus and Grafana:

docker compose up -d

Services:

ServicePortPurpose
aegis9090AEGIS metrics server
prometheus9091Prometheus monitoring
grafana3000Grafana dashboards

Environment Variables (Docker)

VariableDefaultPurpose
AEGIS_CONFIG_PATH/app/schema/interface-contract.yamlConfig file path
AEGIS_METRICS_PORT9090Metrics endpoint port
DATABASE_URLNonePostgreSQL connection string

Volume Mounts

Container PathPurpose
/app/schema/Configuration schemas (read-only)
/app/config/Custom configuration (optional)

5. Kubernetes Deployment

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aegis-governance
  labels:
    app: aegis-governance
spec:
  replicas: 2
  selector:
    matchLabels:
      app: aegis-governance
  template:
    metadata:
      labels:
        app: aegis-governance
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: aegis
          image: aegis-governance:latest
          # TIP: For production, consider moving this to an entrypoint script
          # (e.g., docker-entrypoint.sh) for easier maintenance and log handling.
          command: ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"]
          ports:
            - containerPort: 9090
              name: metrics
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
          livenessProbe:
            exec:
              command: ["aegis", "version"]
            initialDelaySeconds: 5
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /metrics
              port: 9090
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: AEGIS_CONFIG_PATH
              value: "/app/config/config.yaml"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: aegis-secrets
                  key: database-url
                  optional: true
          volumeMounts:
            - name: config
              mountPath: /app/config
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: aegis-config

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: aegis-config
data:
  config.yaml: |
    parameters:
      epsilon_R: 0.01
      epsilon_P: 0.01
      risk_trigger_factor: 2.0
      profit_trigger_factor: 2.0
      trigger_confidence_prob: 0.95

Secret

apiVersion: v1
kind: Secret
metadata:
  name: aegis-secrets
type: Opaque
stringData:
  database-url: "postgresql+asyncpg://{USER}:{PASSWORD}@{HOST}:5432/aegis"
  # HSM credentials (if PQ-hardened profile)
  hsm-pin: "{HSM_PIN}"

Service + ServiceMonitor

apiVersion: v1
kind: Service
metadata:
  name: aegis-governance
  labels:
    app: aegis-governance
spec:
  selector:
    app: aegis-governance
  ports:
    - name: metrics
      port: 9090
      targetPort: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: aegis-governance
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: aegis-governance
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

6. AWS Deployment

Lambda (Single Evaluation)

Use Lambda for on-demand proposal evaluation:

import json
from aegis_governance import AegisConfig, PCWContext, PCWPhase, pcw_decide

config = AegisConfig.default()
evaluator = config.create_gate_evaluator()

def handler(event, context):
    ctx = PCWContext(
        agent_id=event.get("agent_id", "lambda"),
        session_id=context.aws_request_id,
        phase=PCWPhase.PLAN,
        proposal_summary=event["proposal_summary"],
        estimated_impact=event.get("estimated_impact", "medium"),
        risk_proposed=event.get("risk_score", 0.1),
        complexity_score=event.get("complexity_score", 0.5),
        quality_score=event.get("quality_score", 0.8),
    )
    decision = pcw_decide(ctx, gate_evaluator=evaluator)
    return {
        "statusCode": 200,
        "body": json.dumps({
            "status": decision.status.value,
            "rationale": decision.rationale,
            "decision_id": decision.decision_id,
        }),
    }

Package with: pip install -e ".[engine]" -t ./package

ECS/Fargate (Metrics Server)

For long-running metrics collection:

{
  "family": "aegis-governance",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "aegis",
      "image": "aegis-governance:latest",
      "command": ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"],
      "portMappings": [
        { "containerPort": 9090, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "AEGIS_METRICS_PORT", "value": "9090" }
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "aegis version || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

CDK Deployment

AEGIS infrastructure is defined as CDK stacks in infra/. The cdk.json context uses empty placeholder values for account-specific settings; the actual AWS account and region are injected via CI environment variables during deployment. This is standard CDK practice — it avoids committing account identifiers to source control.

See .github/workflows/aegis-deploy.yml for the deployment pipeline, which uses OIDC federation (no static credentials) and supports tag-triggered or manual deployments.

DynamoDB (Audit Trail)

For serverless audit trail storage, implement a DynamoDB-backed repository following the WorkflowPersistence protocol in src/workflows/persistence/repository.py. The ORM models in src/workflows/persistence/models.py define the schema (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint).

Customer Management

The Lambda handler includes customer tracking (Phase 1 visibility). API key identity is extracted from API Gateway, mapped to customer records in DynamoDB, and usage is metered per evaluation. See the Customer Management Guide for setup, CLI commands, and troubleshooting.


7. Observability Setup

Metrics Endpoint

Start the built-in metrics server:

# Programmatic (persistent server in background thread)
from telemetry.metrics_server import MetricsServer
server = MetricsServer(port=9090)
server.start()

To dump current metrics once (useful for debugging):

aegis metrics

Available Metrics

MetricTypeDescription
aegis_decisions_totalCounterDecision outcomes by gate type
aegis_gates_evaluated_totalCounterGate evaluations performed
aegis_latency_secondsHistogramOperation latency
aegis_decision_latency_secondsHistogramEnd-to-end decision latency
aegis_proposals_totalCounterProposals by state
aegis_active_proposalsGaugeCurrently active proposals
aegis_kl_divergenceGaugeCurrent KL divergence
aegis_override_requests_totalCounterOverride requests by outcome
aegis_errors_totalCounterErrors by component

Prometheus Setup

  1. Copy recording rules: monitoring/prometheus/recording-rules.yaml
  2. Copy alerting rules: monitoring/prometheus/alerting-rules.yaml
  3. Configure scrape target: http://aegis:9090/metrics

Pre-computed recording rules:

  • aegis:gate_pass_rate_5m — Gate pass rate over 5 minutes
  • aegis:decision_rate_5m — Decision rate by status
  • aegis:p99_latency_5m — p99 latency by operation
  • aegis:override_rate_1h — Override request rate
  • aegis:error_rate_5m — Error rate by component

Alerting Rules

AlertConditionSeverity
AegisHighGateFailRateGate pass rate < 50% for 10mWarning
AegisHighLatencyp99 latency > 1s for 5mWarning
AegisOverrideSpikeOverride rate > 0.1/min for 15mCritical
AegisDriftCriticalKL divergence status = critical for 5mCritical
AegisOverrideStalePartialPartial override stuck > 2hWarning
AegisErrorRateError rate > 5% for 5mWarning

Grafana Dashboards

Import from monitoring/grafana/:

  • overview-dashboard.json — AEGIS system overview (decisions, gates, latency)
  • risk-analysis-dashboard.json — Risk analytics (drift, Bayesian, overrides)

HTTP Telemetry Sink

Stream telemetry events to a remote collector via HTTP POST.

CLI:

echo '{"risk_proposed": 0.3, "profit_proposed": 0.1}' | \
  aegis evaluate --telemetry-url https://aegis-api-980022636831.us-central1.run.app/v1/events

YAML Configuration (aegis.yaml):

telemetry_url: "https://aegis-api-980022636831.us-central1.run.app/v1/events"

Programmatic (per-event):

from aegis_governance import HTTPEventSink, TelemetryEmitter, pcw_decide

emitter = TelemetryEmitter(source="my-service")
emitter.add_sink(HTTPEventSink(
    url="https://aegis-api-980022636831.us-central1.run.app/v1/events",
    headers={"Authorization": "Bearer ${TOKEN}"},
    timeout=10,
))
result = pcw_decide(context, telemetry_emitter=emitter)

Programmatic (batched — recommended for production):

from aegis_governance import http_sink

sink = http_sink(
    "https://aegis-api-980022636831.us-central1.run.app/v1/events",
    batch_size=100,
    flush_interval_seconds=60,
)
sink.start()  # Background flush thread
emitter.add_sink(sink)
# ... on shutdown:
sink.stop()   # Drains remaining events

Production notes:

  • Use HTTPS with authentication headers (Authorization: Bearer ... or X-API-Key: ...)
  • BatchHTTPSink buffers events and flushes as JSON arrays (bounded: maxlen=batch_size*10)
  • Failures are logged but never propagated — telemetry must not crash the producer
  • Background flush thread is a daemon and sleeps in 1-second increments for responsive shutdown

TLS Requirement (CoSAI MCP-T7)

HTTPEventSink and BatchHTTPSink require https:// URLs by default. Governance telemetry contains decision rationale, gate results, and drift metrics — transmitting this over plaintext risks interception and tampering.

ComponentEnforcementEscape Hatch
HTTPEventSinkRejects http:// URLsallow_insecure=True
BatchHTTPSinkRejects http:// URLsallow_insecure=True
http_sink() factoryRejects http:// URLsallow_insecure=True
MCP telemetry_url paramRejects http:// schemeNone (network-facing)
CLI --telemetry-urlWarns and disables on http://None

Local development (when you need http://):

from telemetry.emitter import HTTPEventSink

# Explicitly opt into insecure transport for local development
sink = HTTPEventSink(
    url="http://localhost:9090/events",
    allow_insecure=True,  # Logs a WARNING for audit trail
)

MCP server: The MCP server (aegis-mcp-server) does not accept http:// telemetry URLs regardless of binding. This is defense-in-depth — a network-accessible endpoint should never relay telemetry over plaintext.

Reference: CoSAI MCP Security Taxonomy, MCP-T7 (Session and Transport Security Failures) recommends TLS 1.2+ for all MCP transport channels.


8. HSM / KMS Integration

Status: Implemented — AWSKMSKEKProvider and HSMKEKProvider available in src/crypto/

8.1 Architecture Overview

AEGIS uses envelope encryption to protect HybridKEM private keys at rest. The HSM or KMS wraps/unwraps the private key blob; all HybridKEM cryptographic operations (X25519 + ML-KEM-768 + AES-256-GCM) execute in software because no HSM natively supports ML-KEM-768.

Startup:
  HSM/KMS unwraps --> HybridKEM private key blob --> held in memory

Runtime:
  encrypt: HybridKEMProvider.encrypt(plaintext, public_key) --> ciphertext
  decrypt: HybridKEMProvider.decrypt(ciphertext, private_key) --> plaintext

The HSM/KMS protects the key at rest (storage, rotation, access control). The KEKProvider abstraction in src/crypto/kek_provider.py selects the appropriate backend via the get_kek_provider() factory.

8.2 Provider Selection

ProviderClassInstallUse Case
AWS KMSAWSKMSKEKProviderpip install -e ".[kms]"AWS deployments (recommended)
PKCS#11 HSMHSMKEKProviderpip install -e ".[hsm]"On-premises / CloudHSM
EnvironmentEnvironmentKEKProvider(built-in)Containerized deployments
In-MemoryInMemoryKEKProvider(built-in)Testing only

The get_kek_provider("auto") factory tries providers in order: KMS, HSM, Environment, In-Memory.

8.3 AWS KMS Provider

Implementation: src/crypto/kms_kek_provider.py

Setup

  1. Create a KMS symmetric key (AES-256):

    aws kms create-key --key-spec SYMMETRIC_DEFAULT \
      --key-usage ENCRYPT_DECRYPT \
      --description "AEGIS KEK wrapping key"
  2. Generate the HybridKEM keypair and wrap the private key:

    python scripts/generate_master_kek.py
    # Outputs: public key (base64), private key (base64)
    
    # Wrap the private key with KMS:
    aws kms encrypt \
      --key-id <KMS_KEY_ARN> \
      --plaintext fileb://private_key.bin \
      --encryption-context purpose=aegis-kek,version=1 \
      --output text --query CiphertextBlob > wrapped_private_key.b64
  3. Set environment variables:

    export AEGIS_KMS_KEY_ID="arn:aws:kms:us-west-2:123456789012:key/..."
    export AEGIS_KMS_WRAPPED_PRIVATE_KEY="$(cat wrapped_private_key.b64)"
    export AEGIS_MASTER_KEK_PUBLIC="$(cat public_key.b64)"

IAM Policy

The execution role needs kms:Decrypt permission with the encryption context condition:

{
  "Effect": "Allow",
  "Action": "kms:Decrypt",
  "Resource": "<KMS_KEY_ARN>",
  "Condition": {
    "StringEquals": {
      "kms:EncryptionContext:purpose": "aegis-kek"
    }
  }
}

Key Rotation

AWS KMS supports automatic annual key rotation. The wrapped blob remains valid because KMS tracks key versions internally. To rotate the HybridKEM keypair itself:

  1. Generate new keypair
  2. Wrap new private key with KMS
  3. Re-encrypt all stored governance keys with the new KEK
  4. Update environment variables and bump version

8.4 HSM Provider (PKCS#11)

Implementation: src/crypto/hsm_kek_provider.py

Supported HSMs

HSMInterfaceNotes
AWS CloudHSMPKCS#11Production recommended
YubiHSM 2PKCS#11On-premises
SoftHSM2PKCS#11Development/testing only

Setup

  1. Install the PKCS#11 library for your HSM (e.g., /opt/cloudhsm/lib/libcloudhsm_pkcs11.so)

  2. Create an AES-256 wrapping key in the HSM:

    # Example for SoftHSM2:
    softhsm2-util --init-token --slot 0 --label aegis --pin 1234 --so-pin 0000
    pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \
      --login --pin 1234 --token-label aegis \
      --keygen --key-type AES:32 --label aegis-wrapping-key
  3. Generate the HybridKEM keypair and wrap with the HSM key:

    python scripts/generate_master_kek.py
    # Then wrap using your HSM's key wrapping utility
    # (CKM_AES_KEY_WRAP_KWP / RFC 5649)
  4. Set environment variables:

    export AEGIS_HSM_PKCS11_LIB="/opt/cloudhsm/lib/libcloudhsm_pkcs11.so"
    export AEGIS_HSM_TOKEN_LABEL="aegis"
    export AEGIS_HSM_PIN="1234"
    export AEGIS_HSM_WRAPPING_KEY_LABEL="aegis-wrapping-key"
    export AEGIS_HSM_WRAPPED_PRIVATE_KEY="$(base64 wrapped_private_key.bin)"
    export AEGIS_MASTER_KEK_PUBLIC="$(cat public_key.b64)"

Session Pooling

HSMKEKProvider maintains a bounded session pool (collections.deque(maxlen=pool_size)) to amortise PKCS#11 session setup costs. Default pool size is 4. Sessions are borrowed for unwrap operations at startup and returned to the pool. Overflow sessions are closed rather than queued.

Key Wrapping Mechanism

The provider uses CKM_AES_KEY_WRAP_KWP (RFC 5649) which supports arbitrary-length payloads. The HybridKEM private key is 2,432 bytes (32 bytes X25519 + 2,400 bytes ML-KEM-768).

8.5 Thread Safety

Both providers use threading.Lock to protect encrypt/decrypt operations. The KMS provider synchronises HybridKEM operations. The HSM provider additionally protects the session pool with a separate lock. boto3 clients are thread-safe by design.

8.6 Configuration Reference

VariableProviderDescription
AEGIS_KMS_KEY_IDKMSKMS key ARN or alias
AEGIS_KMS_WRAPPED_PRIVATE_KEYKMSBase64-encoded KMS-encrypted private key
AEGIS_MASTER_KEK_PUBLICBothBase64-encoded HybridKEM public key (1,216 bytes)
AEGIS_HSM_PKCS11_LIBHSMPath to PKCS#11 shared library
AEGIS_HSM_TOKEN_LABELHSMHSM token/slot label
AEGIS_HSM_PINHSMHSM user PIN
AEGIS_HSM_WRAPPING_KEY_LABELHSMLabel of AES wrapping key in HSM
AEGIS_HSM_WRAPPED_PRIVATE_KEYHSMBase64-encoded HSM-wrapped private key

8.7 Migration from EnvironmentKEKProvider

To migrate from plaintext environment keys to KMS/HSM:

  1. Export current keys: read AEGIS_MASTER_KEK_PRIVATE and AEGIS_MASTER_KEK_PUBLIC
  2. Wrap the private key using KMS encrypt or HSM key wrapping
  3. Set the new environment variables (see sections 8.3 or 8.4)
  4. Remove AEGIS_MASTER_KEK_PRIVATE from the environment
  5. Change provider_type to "kms" or "hsm" (or use "auto" for detection)
  6. Verify with aegis health or by running a test encrypt/decrypt cycle

9. Multi-Region DR

AEGIS uses an active-passive strategy with audit trail replication.

Strategy

Active-passive with audit trail replication:

ComponentPrimaryDR
AEGIS ProcessActiveStandby
PostgreSQLPrimaryStreaming replica
Audit TrailWrite-aheadReplicated

RPO / RTO Targets

ScenarioRPORTO
Auto-checkpoint on transition~0 seconds< 60 seconds
Manual checkpoint≤ 5 minutes< 60 seconds
No persistence configuredN/A (in-memory)Restart only

Recovery Procedure

# 1. Verify system health
aegis health

# 2. Resume pending workflows (programmatic)
from workflows.persistence.durable import DurableWorkflowEngine
engine = DurableWorkflowEngine(session_factory)
await engine.resume_all_pending()

Persistence Layer

  • Production: PostgreSQL with streaming replication
  • Development: SQLite (in-memory or file-based)
  • Schema: src/workflows/persistence/models.py (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint)
  • Hash chain: SHA-256 chained audit trail for tamper detection

10. Production Checklist

  • Configuration validated: aegis validate-config <config.yaml>
  • Metrics endpoint accessible: curl http://localhost:9090/metrics
  • Prometheus scraping confirmed: Check Prometheus targets page
  • Alert rules loaded: Verify in Prometheus UI (Status > Rules)
  • Grafana dashboards imported: Overview + Risk Analysis
  • RBAC roles configured: schema/rbac-definitions.yaml reviewed
  • Audit trail persistence configured: DATABASE_URL set (if using persistence)
  • Health check passing: aegis health or aegis version
  • HSM keys provisioned: (if PQ-hardened profile — see section 8)
  • DR failover tested: (if multi-region — see section 9)
  • Quality gates green: ruff check src/ && black --check src/ && mypy src/ && bandit -c pyproject.toml -r src/ && pytest tests/ -v
  • Telemetry URLs use HTTPS: All telemetry_url values use https:// (enforced by default)
  • No secrets in deployment config: Verify all credentials use environment variables or secret stores

References

On this page