Skip to main content

Demo: Crash Recovery

This demo shows how IntentusNet handles crashes during multi-step execution, enabling safe recovery and resume.

Scenario

A data processing pipeline is executing across three agents. A power failure occurs mid-execution. The system must recover and resume without duplicating work or losing state.

Input Intent

{
"intent": {
"name": "ProcessPipelineIntent",
"version": "1.0"
},
"payload": {
"pipeline": "daily-etl",
"steps": ["extract", "transform", "load"]
},
"routing": {
"strategy": "BROADCAST"
}
}

Initial Execution

Timeline:
─────────────────────────────────────────────────────────────

t0: Intent received
└─ Event: INTENT_RECEIVED recorded

t1: Extract agent starts
└─ Event: AGENT_ATTEMPT_START (extract-agent) recorded

t2: Extract agent completes
└─ Event: AGENT_ATTEMPT_END (extract-agent, success) recorded
└─ Data: 1M records extracted

t3: Transform agent starts
└─ Event: AGENT_ATTEMPT_START (transform-agent) recorded

t4: ⚡ POWER FAILURE ⚡
└─ System crashes mid-transform
└─ No AGENT_ATTEMPT_END recorded

─────────────────────────────────────────────────────────────

Execution Record at Crash

{
"header": {
"executionId": "exec-pipeline-2024-001",
"createdUtcIso": "2024-01-15T10:30:00Z",
"envelopeHash": "sha256:a1b2c3d4...",
"replayable": false,
"replayableReason": "execution_incomplete"
},
"events": [
{"seq": 1, "type": "INTENT_RECEIVED", "payload": {...}},
{"seq": 2, "type": "AGENT_ATTEMPT_START", "payload": {"agent": "extract-agent"}},
{"seq": 3, "type": "AGENT_ATTEMPT_END", "payload": {"agent": "extract-agent", "status": "success"}},
{"seq": 4, "type": "AGENT_ATTEMPT_START", "payload": {"agent": "transform-agent"}}
],
"finalResponse": null
}

Recovery Analysis

$ intentusnet inspect exec-pipeline-2024-001

{
"execution_id": "exec-pipeline-2024-001",
"status": "incomplete",
"replayable": false,
"replayable_reason": "execution_incomplete",
"crash_analysis": {
"last_event": {
"seq": 4,
"type": "AGENT_ATTEMPT_START",
"agent": "transform-agent"
},
"completed_agents": ["extract-agent"],
"in_progress_agent": "transform-agent",
"not_started_agents": ["load-agent"],
"recommendation": "Check transform-agent idempotency before retry"
}
}

Recovery Decision

┌─────────────────────────────────────────────────────────────┐
│ Recovery Analysis │
├─────────────────────────────────────────────────────────────┤
│ │
│ Agent │ Status │ Action Required │
│ ─────────────────┼─────────────┼─────────────────────────── │
│ extract-agent │ ✓ Completed │ Skip (already done) │
│ transform-agent │ ? Unknown │ Check state, decide │
│ load-agent │ ○ Not run │ Will run after transform │
│ │
│ Options: │
│ 1. Resume from transform-agent (if idempotent) │
│ 2. Rollback extract, restart entirely │
│ 3. Manual inspection of transform state │
│ │
└─────────────────────────────────────────────────────────────┘

Resume Execution

If transform-agent is idempotent:

# Load crash state
store = FileExecutionStore(".intentusnet/records")
crashed_record = store.load("exec-pipeline-2024-001")

# Identify completed work
completed = [e for e in crashed_record.events
if e.type == "AGENT_ATTEMPT_END" and e.payload.get("status") == "success"]

# Resume from crashed point
resume_from = ["transform-agent", "load-agent"]

# Create resume intent
resume_envelope = IntentEnvelope(
intent=IntentRef(name="ProcessPipelineIntent"),
payload={
"pipeline": "daily-etl",
"steps": ["transform", "load"], # Skip completed
"resume_from": "exec-pipeline-2024-001"
},
# ...
)

response = runtime.router.route_intent(resume_envelope)

Resume Execution Trace

{
"execution_id": "exec-pipeline-2024-002",
"metadata": {
"resume_from": "exec-pipeline-2024-001",
"skipped_agents": ["extract-agent"]
},
"events": [
{"seq": 1, "type": "INTENT_RECEIVED", "payload": {"resumed": true}},
{"seq": 2, "type": "AGENT_ATTEMPT_START", "payload": {"agent": "transform-agent"}},
{"seq": 3, "type": "AGENT_ATTEMPT_END", "payload": {"agent": "transform-agent", "status": "success"}},
{"seq": 4, "type": "AGENT_ATTEMPT_START", "payload": {"agent": "load-agent"}},
{"seq": 5, "type": "AGENT_ATTEMPT_END", "payload": {"agent": "load-agent", "status": "success"}},
{"seq": 6, "type": "FINAL_RESPONSE", "payload": {"status": "success"}}
]
}

Final State

Pipeline Execution Summary:
───────────────────────────────────────────────────────────────

Original Execution: exec-pipeline-2024-001
├─ extract-agent: ✓ Completed
├─ transform-agent: ✗ Crashed during
└─ load-agent: ○ Not started

Resume Execution: exec-pipeline-2024-002
├─ extract-agent: ⊘ Skipped (completed in original)
├─ transform-agent: ✓ Completed (retry succeeded)
└─ load-agent: ✓ Completed

Combined Result:
All 3 pipeline steps completed successfully.
No duplicate processing of extract step.

Replay Note

After successful resume, both executions are replayable:

# Original (incomplete)
$ intentusnet replay exec-pipeline-2024-001 --force
{
"warning": "Forced replay of incomplete record",
"last_complete_step": "extract-agent",
"incomplete": true
}

# Resume (complete)
$ intentusnet replay exec-pipeline-2024-002
{
"from_replay": true,
"payload": {
"pipeline": "daily-etl",
"status": "completed",
"steps_completed": ["transform", "load"]
}
}

Key Points

AspectBehavior
Crash detectionRecord marked incomplete
State preservationEvents up to crash saved
Recovery analysisIdentifies crash point
ResumeSkip completed, retry from crash
Audit trailBoth executions recorded

Code Example

from intentusnet import IntentusRuntime, FileExecutionStore

def recover_from_crash(execution_id: str):
store = FileExecutionStore(".intentusnet/records")
record = store.load(execution_id)

# Analyze crash point
completed_agents = []
crashed_agent = None

for event in record.events:
if event.type == "AGENT_ATTEMPT_END":
if event.payload.get("status") == "success":
completed_agents.append(event.payload["agent"])
elif event.type == "AGENT_ATTEMPT_START":
# If no corresponding END, this is where we crashed
crashed_agent = event.payload["agent"]

print(f"Completed: {completed_agents}")
print(f"Crashed during: {crashed_agent}")

# Decide: retry or rollback
if is_idempotent(crashed_agent):
return resume_from(crashed_agent)
else:
return rollback_and_retry()

See Also