Skip to content

Disaster Recovery Assessment

Version: 1.0.0 | Date: 2026-02-06 | Status: Phase 1 Complete


1. Current Architecture

AEGIS uses a checkpoint-based persistence model with the following components:

Component Implementation Location
State Storage SQLAlchemy async ORM src/workflows/persistence/repository.py
Checkpoint Model SHA-256 hashed snapshots src/workflows/persistence/models.py
Audit Trail Hash-chained transitions WorkflowTransition.compute_hash()
Recovery Engine DurableWorkflowEngine src/workflows/persistence/durable.py
Health Check aegis health CLI src/cli.py

Supported Databases

  • SQLite (development/testing): In-memory or file-based
  • PostgreSQL (production): Via asyncpg driver

2. Recovery Point Objective (RPO)

RPO = Time since last checkpoint

Scenario RPO Notes
Auto-checkpoint on transition ~0 seconds State saved on every transition
Manual checkpoint Configurable Recommend <= 5 minutes
No persistence configured N/A (in-memory only) Data lost on process termination

Recommendation

Enable checkpoint_on_transition=True (default) for zero-RPO on state transitions. For long-running workflows between transitions, add periodic checkpoints at 5-minute intervals.


3. Recovery Time Objective (RTO)

RTO = Process restart + resume_all_pending() execution time

Component Estimated Time Notes
Process restart < 5 seconds Application startup
Database reconnection < 2 seconds Connection pool initialization
resume_all_pending() < 30 seconds For up to 100 workflows
Health verification < 5 seconds aegis health check
Total RTO < 60 seconds Single-instance recovery

Verification

# Verify system health after recovery
aegis health

# Resume pending workflows (programmatic)
engine = DurableWorkflowEngine(persistence)
workflows = await engine.resume_all_pending(ProposalWorkflow)

4. Integrity Guarantees

Hash Chain Verification

Every state transition is recorded with a SHA-256 hash chain:

Transition N: hash(workflow_id + from_state + to_state + actor + timestamp + previous_hash)

Tamper detection: If any transition record is modified, verify_audit_chain() will detect the broken chain and report the specific transition.

Checkpoint Integrity

Each checkpoint stores a SHA-256 hash of the serialized state snapshot. On restore, the hash can be recomputed and compared.

Verification Commands

# CLI health check (includes chain verification when persistence available)
aegis health
# Programmatic verification
valid, error = await engine.verify_integrity(workflow_id)
assert valid, f"Chain broken: {error}"

5. Tested Failure Scenarios

The following scenarios are covered by integration tests (tests/integration/test_dr_recovery.py):

Test Scenario Verified
Checkpoint-Resume Create, checkpoint, discard, resume State matches
Post-Transition Resume Transition, checkpoint, resume Correct state restored
Hash Chain Valid Multiple transitions Chain verification passes
Batch Recovery 5 workflows, resume all All recovered
Completed Exclusion Completed workflows Not in pending list
Mixed States Workflows at different stages All resume correctly
Time-Travel Load historical checkpoint Correct historical state
Export/Import to_dict() / from_dict() round-trip State preserved
Audit Trail Multiple transitions All recorded with hashes

6. Current Limitations

Limitation Impact Mitigation Path
Single-instance only No automatic failover Add async replication (Phase 2)
No real-time replication Potential data loss if DB fails Use PostgreSQL with streaming replication
No multi-region Single region dependency Deploy read replicas in secondary region
No point-in-time recovery Cannot restore to arbitrary timestamp Use PostgreSQL PITR with WAL archiving
No automated backup Manual backup required Add scheduled pg_dump or WAL archiving

7. Phase 2 Upgrade Path

When deployment infrastructure is available:

  1. PostgreSQL streaming replication for hot standby
  2. WAL archiving for point-in-time recovery
  3. Scheduled backups via pg_dump or cloud provider snapshots
  4. Multi-region read replicas for geographic redundancy
  5. Automated failover via Patroni or cloud provider HA
  6. Live DR drill simulating region outage with documented runbook