Production Deployment Guide
Deploy AEGIS in production environments including Docker, Kubernetes, and AWS.
Version: 1.0.0 | Updated: 2026-02-09 | Status: Active
This guide covers deploying AEGIS in production environments including Docker, Kubernetes, and AWS.
1. Prerequisites
- Python: 3.9+ (3.11 recommended for production)
- pip: 21.0+
- OS: Linux (Ubuntu 22.04+, Amazon Linux 2023) or macOS
- Optional: Docker 24+, Kubernetes 1.28+, PostgreSQL 15+
Deployment Profiles
| Profile | Install Command | Use Case |
|---|---|---|
| Minimal | pip install -e "." | Evaluation only, zero dependencies |
| Standard | pip install -e ".[engine,telemetry]" | Production with metrics + scipy z-scores |
| Full | pip install -e ".[all]" | All features including crypto + persistence |
| PQ-Hardened | pip install -e ".[crypto,pqc,persistence]" | Post-quantum signatures + durable state |
2. Installation Profiles
Minimal (Zero Dependencies)
pip install -e "."Provides: pcw_decide(), CLI (aegis evaluate), gate evaluation, Bayesian posterior. No scipy (z-scores unavailable), no Prometheus metrics, no YAML config loading.
Standard (Recommended for Production)
pip install -e ".[engine,telemetry,config]"Adds:
- engine: scipy for utility z-score computation
- telemetry: prometheus_client for Prometheus metrics exporter
- config: pyyaml for YAML configuration loading
Full
pip install -e ".[all]"Adds all optional groups: engine, telemetry, config, mcp, crypto, pqc, persistence.
Optional Dependency Groups
| Group | Package(s) | Purpose |
|---|---|---|
engine | scipy | Utility z-score computation |
telemetry | prometheus_client | Prometheus metrics exporter |
config | pyyaml | YAML configuration loading |
mcp | pyyaml | MCP server for AI agents |
crypto | btclib, coincurve | BIP-340 Schnorr signatures |
pqc | liboqs-python | ML-DSA-44, ML-KEM-768 (requires native liboqs) |
persistence | sqlalchemy, asyncpg, aiosqlite | Durable workflow state |
3. Configuration
Default Configuration
from aegis_governance import AegisConfig
# Uses frozen defaults matching schema/interface-contract.yaml
config = AegisConfig.default()YAML Configuration
# Requires: pip install -e ".[config]"
config = AegisConfig.from_yaml("config.yaml")Example config.yaml:
parameters:
epsilon_R: 0.01
epsilon_P: 0.01
risk_trigger_factor: 2.0
profit_trigger_factor: 2.0
trigger_confidence_prob: 0.95
novelty_gate:
N0: 0.7
k: 10.0
output_threshold: 0.6
complexity_floor: 0.5
quality_min_score: 0.7Dict Configuration
config = AegisConfig.from_dict({"epsilon_R": 0.02, "quality_min_score": 0.8})Environment Variables
| Variable | Purpose | Default |
|---|---|---|
AEGIS_CONFIG_PATH | Path to YAML config file | None (uses defaults) |
AEGIS_METRICS_PORT | Metrics server port | 9090 |
AEGIS_LOG_LEVEL | Logging level | INFO |
DATABASE_URL | PostgreSQL connection string | None (in-memory) |
Frozen Parameter Policy
schema/interface-contract.yaml is the authoritative source for parameter values. AegisConfig defaults match this file exactly. Runtime mutation is impossible (frozen dataclass). Parameter changes require formal recalibration approval.
4. Docker Deployment
Dockerfile
The repository includes a multi-stage Dockerfile:
docker build -t aegis-governance .Key features:
- Multi-stage build (smaller image)
- Non-root user (
aegis, UID 1000) - Health check via
aegis version - Exposes port 9090 for metrics
Docker Compose
Start AEGIS with Prometheus and Grafana:
docker compose up -dServices:
| Service | Port | Purpose |
|---|---|---|
aegis | 9090 | AEGIS metrics server |
prometheus | 9091 | Prometheus monitoring |
grafana | 3000 | Grafana dashboards |
Environment Variables (Docker)
| Variable | Default | Purpose |
|---|---|---|
AEGIS_CONFIG_PATH | /app/schema/interface-contract.yaml | Config file path |
AEGIS_METRICS_PORT | 9090 | Metrics endpoint port |
DATABASE_URL | None | PostgreSQL connection string |
Volume Mounts
| Container Path | Purpose |
|---|---|
/app/schema/ | Configuration schemas (read-only) |
/app/config/ | Custom configuration (optional) |
5. Kubernetes Deployment
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: aegis-governance
labels:
app: aegis-governance
spec:
replicas: 2
selector:
matchLabels:
app: aegis-governance
template:
metadata:
labels:
app: aegis-governance
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: aegis
image: aegis-governance:latest
# TIP: For production, consider moving this to an entrypoint script
# (e.g., docker-entrypoint.sh) for easier maintenance and log handling.
command: ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"]
ports:
- containerPort: 9090
name: metrics
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
livenessProbe:
exec:
command: ["aegis", "version"]
initialDelaySeconds: 5
periodSeconds: 30
readinessProbe:
httpGet:
path: /metrics
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: AEGIS_CONFIG_PATH
value: "/app/config/config.yaml"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: aegis-secrets
key: database-url
optional: true
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
volumes:
- name: config
configMap:
name: aegis-configConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: aegis-config
data:
config.yaml: |
parameters:
epsilon_R: 0.01
epsilon_P: 0.01
risk_trigger_factor: 2.0
profit_trigger_factor: 2.0
trigger_confidence_prob: 0.95Secret
apiVersion: v1
kind: Secret
metadata:
name: aegis-secrets
type: Opaque
stringData:
database-url: "postgresql+asyncpg://{USER}:{PASSWORD}@{HOST}:5432/aegis"
# HSM credentials (if PQ-hardened profile)
hsm-pin: "{HSM_PIN}"Service + ServiceMonitor
apiVersion: v1
kind: Service
metadata:
name: aegis-governance
labels:
app: aegis-governance
spec:
selector:
app: aegis-governance
ports:
- name: metrics
port: 9090
targetPort: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: aegis-governance
labels:
release: prometheus
spec:
selector:
matchLabels:
app: aegis-governance
endpoints:
- port: metrics
interval: 30s
path: /metrics6. AWS Deployment
Lambda (Single Evaluation)
Use Lambda for on-demand proposal evaluation:
import json
from aegis_governance import AegisConfig, PCWContext, PCWPhase, pcw_decide
config = AegisConfig.default()
evaluator = config.create_gate_evaluator()
def handler(event, context):
ctx = PCWContext(
agent_id=event.get("agent_id", "lambda"),
session_id=context.aws_request_id,
phase=PCWPhase.PLAN,
proposal_summary=event["proposal_summary"],
estimated_impact=event.get("estimated_impact", "medium"),
risk_proposed=event.get("risk_score", 0.1),
complexity_score=event.get("complexity_score", 0.5),
quality_score=event.get("quality_score", 0.8),
)
decision = pcw_decide(ctx, gate_evaluator=evaluator)
return {
"statusCode": 200,
"body": json.dumps({
"status": decision.status.value,
"rationale": decision.rationale,
"decision_id": decision.decision_id,
}),
}Package with: pip install -e ".[engine]" -t ./package
ECS/Fargate (Metrics Server)
For long-running metrics collection:
{
"family": "aegis-governance",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "aegis",
"image": "aegis-governance:latest",
"command": ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"],
"portMappings": [
{ "containerPort": 9090, "protocol": "tcp" }
],
"environment": [
{ "name": "AEGIS_METRICS_PORT", "value": "9090" }
],
"healthCheck": {
"command": ["CMD-SHELL", "aegis version || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}CDK Deployment
AEGIS infrastructure is defined as CDK stacks in infra/. The cdk.json context
uses empty placeholder values for account-specific settings; the actual AWS account
and region are injected via CI environment variables during deployment. This is
standard CDK practice — it avoids committing account identifiers to source control.
See .github/workflows/aegis-deploy.yml for the deployment pipeline, which uses
OIDC federation (no static credentials) and supports tag-triggered or manual
deployments.
DynamoDB (Audit Trail)
For serverless audit trail storage, implement a DynamoDB-backed repository following the WorkflowPersistence protocol in src/workflows/persistence/repository.py. The ORM models in src/workflows/persistence/models.py define the schema (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint).
Customer Management
The Lambda handler includes customer tracking (Phase 1 visibility). API key identity is extracted from API Gateway, mapped to customer records in DynamoDB, and usage is metered per evaluation. See the Customer Management Guide for setup, CLI commands, and troubleshooting.
7. Observability Setup
Metrics Endpoint
Start the built-in metrics server:
# Programmatic (persistent server in background thread)
from telemetry.metrics_server import MetricsServer
server = MetricsServer(port=9090)
server.start()To dump current metrics once (useful for debugging):
aegis metricsAvailable Metrics
| Metric | Type | Description |
|---|---|---|
aegis_decisions_total | Counter | Decision outcomes by gate type |
aegis_gates_evaluated_total | Counter | Gate evaluations performed |
aegis_latency_seconds | Histogram | Operation latency |
aegis_decision_latency_seconds | Histogram | End-to-end decision latency |
aegis_proposals_total | Counter | Proposals by state |
aegis_active_proposals | Gauge | Currently active proposals |
aegis_kl_divergence | Gauge | Current KL divergence |
aegis_override_requests_total | Counter | Override requests by outcome |
aegis_errors_total | Counter | Errors by component |
Prometheus Setup
- Copy recording rules:
monitoring/prometheus/recording-rules.yaml - Copy alerting rules:
monitoring/prometheus/alerting-rules.yaml - Configure scrape target:
http://aegis:9090/metrics
Pre-computed recording rules:
aegis:gate_pass_rate_5m— Gate pass rate over 5 minutesaegis:decision_rate_5m— Decision rate by statusaegis:p99_latency_5m— p99 latency by operationaegis:override_rate_1h— Override request rateaegis:error_rate_5m— Error rate by component
Alerting Rules
| Alert | Condition | Severity |
|---|---|---|
AegisHighGateFailRate | Gate pass rate < 50% for 10m | Warning |
AegisHighLatency | p99 latency > 1s for 5m | Warning |
AegisOverrideSpike | Override rate > 0.1/min for 15m | Critical |
AegisDriftCritical | KL divergence status = critical for 5m | Critical |
AegisOverrideStalePartial | Partial override stuck > 2h | Warning |
AegisErrorRate | Error rate > 5% for 5m | Warning |
Grafana Dashboards
Import from monitoring/grafana/:
overview-dashboard.json— AEGIS system overview (decisions, gates, latency)risk-analysis-dashboard.json— Risk analytics (drift, Bayesian, overrides)
HTTP Telemetry Sink
Stream telemetry events to a remote collector via HTTP POST.
CLI:
echo '{"risk_proposed": 0.3, "profit_proposed": 0.1}' | \
aegis evaluate --telemetry-url https://aegis-api-980022636831.us-central1.run.app/v1/eventsYAML Configuration (aegis.yaml):
telemetry_url: "https://aegis-api-980022636831.us-central1.run.app/v1/events"Programmatic (per-event):
from aegis_governance import HTTPEventSink, TelemetryEmitter, pcw_decide
emitter = TelemetryEmitter(source="my-service")
emitter.add_sink(HTTPEventSink(
url="https://aegis-api-980022636831.us-central1.run.app/v1/events",
headers={"Authorization": "Bearer ${TOKEN}"},
timeout=10,
))
result = pcw_decide(context, telemetry_emitter=emitter)Programmatic (batched — recommended for production):
from aegis_governance import http_sink
sink = http_sink(
"https://aegis-api-980022636831.us-central1.run.app/v1/events",
batch_size=100,
flush_interval_seconds=60,
)
sink.start() # Background flush thread
emitter.add_sink(sink)
# ... on shutdown:
sink.stop() # Drains remaining eventsProduction notes:
- Use HTTPS with authentication headers (
Authorization: Bearer ...orX-API-Key: ...) BatchHTTPSinkbuffers events and flushes as JSON arrays (bounded:maxlen=batch_size*10)- Failures are logged but never propagated — telemetry must not crash the producer
- Background flush thread is a daemon and sleeps in 1-second increments for responsive shutdown
TLS Requirement (CoSAI MCP-T7)
HTTPEventSink and BatchHTTPSink require https:// URLs by default. Governance telemetry contains decision rationale, gate results, and drift metrics — transmitting this over plaintext risks interception and tampering.
| Component | Enforcement | Escape Hatch |
|---|---|---|
HTTPEventSink | Rejects http:// URLs | allow_insecure=True |
BatchHTTPSink | Rejects http:// URLs | allow_insecure=True |
http_sink() factory | Rejects http:// URLs | allow_insecure=True |
MCP telemetry_url param | Rejects http:// scheme | None (network-facing) |
CLI --telemetry-url | Warns and disables on http:// | None |
Local development (when you need http://):
from telemetry.emitter import HTTPEventSink
# Explicitly opt into insecure transport for local development
sink = HTTPEventSink(
url="http://localhost:9090/events",
allow_insecure=True, # Logs a WARNING for audit trail
)MCP server: The MCP server (aegis-mcp-server) does not accept http:// telemetry URLs regardless of binding. This is defense-in-depth — a network-accessible endpoint should never relay telemetry over plaintext.
Reference: CoSAI MCP Security Taxonomy, MCP-T7 (Session and Transport Security Failures) recommends TLS 1.2+ for all MCP transport channels.
8. HSM / KMS Integration
Status: Implemented —
AWSKMSKEKProviderandHSMKEKProvideravailable insrc/crypto/
8.1 Architecture Overview
AEGIS uses envelope encryption to protect HybridKEM private keys at rest. The HSM or KMS wraps/unwraps the private key blob; all HybridKEM cryptographic operations (X25519 + ML-KEM-768 + AES-256-GCM) execute in software because no HSM natively supports ML-KEM-768.
Startup:
HSM/KMS unwraps --> HybridKEM private key blob --> held in memory
Runtime:
encrypt: HybridKEMProvider.encrypt(plaintext, public_key) --> ciphertext
decrypt: HybridKEMProvider.decrypt(ciphertext, private_key) --> plaintextThe HSM/KMS protects the key at rest (storage, rotation, access control). The KEKProvider abstraction in src/crypto/kek_provider.py selects the appropriate backend via the get_kek_provider() factory.
8.2 Provider Selection
| Provider | Class | Install | Use Case |
|---|---|---|---|
| AWS KMS | AWSKMSKEKProvider | pip install -e ".[kms]" | AWS deployments (recommended) |
| PKCS#11 HSM | HSMKEKProvider | pip install -e ".[hsm]" | On-premises / CloudHSM |
| Environment | EnvironmentKEKProvider | (built-in) | Containerized deployments |
| In-Memory | InMemoryKEKProvider | (built-in) | Testing only |
The get_kek_provider("auto") factory tries providers in order: KMS, HSM, Environment, In-Memory.
8.3 AWS KMS Provider
Implementation: src/crypto/kms_kek_provider.py
Setup
-
Create a KMS symmetric key (AES-256):
aws kms create-key --key-spec SYMMETRIC_DEFAULT \ --key-usage ENCRYPT_DECRYPT \ --description "AEGIS KEK wrapping key" -
Generate the HybridKEM keypair and wrap the private key:
python scripts/generate_master_kek.py # Outputs: public key (base64), private key (base64) # Wrap the private key with KMS: aws kms encrypt \ --key-id <KMS_KEY_ARN> \ --plaintext fileb://private_key.bin \ --encryption-context purpose=aegis-kek,version=1 \ --output text --query CiphertextBlob > wrapped_private_key.b64 -
Set environment variables:
export AEGIS_KMS_KEY_ID="arn:aws:kms:us-west-2:123456789012:key/..." export AEGIS_KMS_WRAPPED_PRIVATE_KEY="$(cat wrapped_private_key.b64)" export AEGIS_MASTER_KEK_PUBLIC="$(cat public_key.b64)"
IAM Policy
The execution role needs kms:Decrypt permission with the encryption context condition:
{
"Effect": "Allow",
"Action": "kms:Decrypt",
"Resource": "<KMS_KEY_ARN>",
"Condition": {
"StringEquals": {
"kms:EncryptionContext:purpose": "aegis-kek"
}
}
}Key Rotation
AWS KMS supports automatic annual key rotation. The wrapped blob remains valid because KMS tracks key versions internally. To rotate the HybridKEM keypair itself:
- Generate new keypair
- Wrap new private key with KMS
- Re-encrypt all stored governance keys with the new KEK
- Update environment variables and bump
version
8.4 HSM Provider (PKCS#11)
Implementation: src/crypto/hsm_kek_provider.py
Supported HSMs
| HSM | Interface | Notes |
|---|---|---|
| AWS CloudHSM | PKCS#11 | Production recommended |
| YubiHSM 2 | PKCS#11 | On-premises |
| SoftHSM2 | PKCS#11 | Development/testing only |
Setup
-
Install the PKCS#11 library for your HSM (e.g.,
/opt/cloudhsm/lib/libcloudhsm_pkcs11.so) -
Create an AES-256 wrapping key in the HSM:
# Example for SoftHSM2: softhsm2-util --init-token --slot 0 --label aegis --pin 1234 --so-pin 0000 pkcs11-tool --module /usr/lib/softhsm/libsofthsm2.so \ --login --pin 1234 --token-label aegis \ --keygen --key-type AES:32 --label aegis-wrapping-key -
Generate the HybridKEM keypair and wrap with the HSM key:
python scripts/generate_master_kek.py # Then wrap using your HSM's key wrapping utility # (CKM_AES_KEY_WRAP_KWP / RFC 5649) -
Set environment variables:
export AEGIS_HSM_PKCS11_LIB="/opt/cloudhsm/lib/libcloudhsm_pkcs11.so" export AEGIS_HSM_TOKEN_LABEL="aegis" export AEGIS_HSM_PIN="1234" export AEGIS_HSM_WRAPPING_KEY_LABEL="aegis-wrapping-key" export AEGIS_HSM_WRAPPED_PRIVATE_KEY="$(base64 wrapped_private_key.bin)" export AEGIS_MASTER_KEK_PUBLIC="$(cat public_key.b64)"
Session Pooling
HSMKEKProvider maintains a bounded session pool (collections.deque(maxlen=pool_size)) to amortise PKCS#11 session setup costs. Default pool size is 4. Sessions are borrowed for unwrap operations at startup and returned to the pool. Overflow sessions are closed rather than queued.
Key Wrapping Mechanism
The provider uses CKM_AES_KEY_WRAP_KWP (RFC 5649) which supports arbitrary-length payloads. The HybridKEM private key is 2,432 bytes (32 bytes X25519 + 2,400 bytes ML-KEM-768).
8.5 Thread Safety
Both providers use threading.Lock to protect encrypt/decrypt operations. The KMS provider synchronises HybridKEM operations. The HSM provider additionally protects the session pool with a separate lock. boto3 clients are thread-safe by design.
8.6 Configuration Reference
| Variable | Provider | Description |
|---|---|---|
AEGIS_KMS_KEY_ID | KMS | KMS key ARN or alias |
AEGIS_KMS_WRAPPED_PRIVATE_KEY | KMS | Base64-encoded KMS-encrypted private key |
AEGIS_MASTER_KEK_PUBLIC | Both | Base64-encoded HybridKEM public key (1,216 bytes) |
AEGIS_HSM_PKCS11_LIB | HSM | Path to PKCS#11 shared library |
AEGIS_HSM_TOKEN_LABEL | HSM | HSM token/slot label |
AEGIS_HSM_PIN | HSM | HSM user PIN |
AEGIS_HSM_WRAPPING_KEY_LABEL | HSM | Label of AES wrapping key in HSM |
AEGIS_HSM_WRAPPED_PRIVATE_KEY | HSM | Base64-encoded HSM-wrapped private key |
8.7 Migration from EnvironmentKEKProvider
To migrate from plaintext environment keys to KMS/HSM:
- Export current keys: read
AEGIS_MASTER_KEK_PRIVATEandAEGIS_MASTER_KEK_PUBLIC - Wrap the private key using KMS
encryptor HSM key wrapping - Set the new environment variables (see sections 8.3 or 8.4)
- Remove
AEGIS_MASTER_KEK_PRIVATEfrom the environment - Change
provider_typeto"kms"or"hsm"(or use"auto"for detection) - Verify with
aegis healthor by running a test encrypt/decrypt cycle
9. Multi-Region DR
AEGIS uses an active-passive strategy with audit trail replication.
Strategy
Active-passive with audit trail replication:
| Component | Primary | DR |
|---|---|---|
| AEGIS Process | Active | Standby |
| PostgreSQL | Primary | Streaming replica |
| Audit Trail | Write-ahead | Replicated |
RPO / RTO Targets
| Scenario | RPO | RTO |
|---|---|---|
| Auto-checkpoint on transition | ~0 seconds | < 60 seconds |
| Manual checkpoint | ≤ 5 minutes | < 60 seconds |
| No persistence configured | N/A (in-memory) | Restart only |
Recovery Procedure
# 1. Verify system health
aegis health
# 2. Resume pending workflows (programmatic)
from workflows.persistence.durable import DurableWorkflowEngine
engine = DurableWorkflowEngine(session_factory)
await engine.resume_all_pending()Persistence Layer
- Production: PostgreSQL with streaming replication
- Development: SQLite (in-memory or file-based)
- Schema:
src/workflows/persistence/models.py(WorkflowInstance, WorkflowTransition, WorkflowCheckpoint) - Hash chain: SHA-256 chained audit trail for tamper detection
10. Production Checklist
- Configuration validated:
aegis validate-config <config.yaml> - Metrics endpoint accessible:
curl http://localhost:9090/metrics - Prometheus scraping confirmed: Check Prometheus targets page
- Alert rules loaded: Verify in Prometheus UI (Status > Rules)
- Grafana dashboards imported: Overview + Risk Analysis
- RBAC roles configured:
schema/rbac-definitions.yamlreviewed - Audit trail persistence configured:
DATABASE_URLset (if using persistence) - Health check passing:
aegis healthoraegis version - HSM keys provisioned: (if PQ-hardened profile — see section 8)
- DR failover tested: (if multi-region — see section 9)
- Quality gates green:
ruff check src/ && black --check src/ && mypy src/ && bandit -c pyproject.toml -r src/ && pytest tests/ -v - Telemetry URLs use HTTPS: All
telemetry_urlvalues usehttps://(enforced by default) - No secrets in deployment config: Verify all credentials use environment variables or secret stores
References
- README.md — Quick start and installation
- monitoring/README.md — Metrics endpoint setup
- Interface Contract — Frozen parameters
- Performance SLAs — Latency and throughput targets
- Migration Guide — Upgrade procedures