Compute & Orchestration¶

The compute layer is split into two planes: a data plane (ECS on EC2) that runs long-lived API servers and per-user worker processes, and a control plane (Lambda + SQS) that orchestrates worker lifecycle, background tasks, and maintenance. This separation means the orchestrator has no state to lose and the workers have no management overhead.

Operational guarantees

Zero-downtime deployments via ECS rolling updates with circuit breaker (100% min healthy, 200% max). Automatic rollback if new tasks fail health checks. No single point of failure in the control plane — Lambda functions are inherently HA with AWS-managed retries. The data plane recovers from any single-container failure within 60 seconds via the maintenance Lambda.

ECS Cluster¶

The cluster uses two capacity providers backed by separate Auto Scaling Groups, each optimized for its workload profile:

flowchart LR subgraph cluster["ECS Cluster"] subgraph apiCP["API Capacity Provider"] API["m7g.large<br/>2 API tasks"] end subgraph workerCP["Worker Capacity Provider"] W["r7g.xlarge<br/>~60 workers/instance"] end end subgraph control["Serverless Control Plane"] SQS["SQS FIFO x2"] --> L["Lambda x5"] EB["EventBridge"] --> L end L -->|"RunTask / Pool Claim"| W

Why Two Capacity Providers

API tasks need CPU headroom for request processing (compute-bound). Worker tasks need memory for broker SDK sessions but barely touch the CPU during idle periods (memory-bound). Mixing them on the same instance type wastes resources in both directions.

API Capacity Provider¶

Parameter	Value
Instance Type	m7g.large (Graviton, 2 vCPU, 8 GB)
Tasks per Instance	2
CPU per Task	1024 units (0.5 vCPU)
Memory per Task	1536 MB soft / 3072 MB hard
Web Server	Gunicorn with 8 Uvicorn workers
Network Mode	bridge (dynamic host ports)
Deployment	Rolling update, circuit breaker enabled
Min Healthy %	100% (zero-downtime deploys)
Max %	200% (double capacity during deploy)

Auto-Scaling Policies¶

The API service uses target-tracking scaling on two dimensions:

Policy	Target	Scale-Out Cooldown	Scale-In Cooldown
CPU Utilization	70% average	60s	300s
Request Count	1000 req/min per target	60s	300s

Scale-in is deliberately slow (300s cooldown) to avoid thrashing during intermittent traffic spikes around market open/close.

Rolling Deployment¶

Deployments use ECS rolling update with circuit breaker:

New task definition registered
ECS launches new tasks (up to 200% capacity)
ALB health checks confirm new tasks are healthy
Old tasks drain connections (300s deregistration delay)
Old tasks stopped

If the new tasks fail health checks, the circuit breaker automatically rolls back to the previous task definition — no manual intervention required.

Worker Capacity Provider¶

Parameter	Value
Instance Type	r7g.xlarge (Graviton, 4 vCPU, 32 GB)
Tasks per Instance	60 (conservative)
CPU per Task	64 units
Memory Soft Limit	384 MB
Memory Hard Limit	1024 MB
Network Mode	bridge (shared host ENI)
Desired Count	Managed by Lambda orchestrator

Capacity Math¶

Each r7g.xlarge provides 4096 CPU units and 32,768 MB RAM:

Resource	Available	Per Task	Max Tasks	Limiting?
CPU	4096 units	64 units	4096/64 = 64	No
Memory	32,768 MB	384 MB soft	32768/384 = 85	Yes (soft)
Memory (hard)	32,768 MB	1024 MB hard	32768/1024 = 32	Worst-case

The target of 60 tasks per instance is conservative — it leaves headroom for:

OS and ECS agent overhead (~512 MB)
Temporary memory spikes during broker API calls
Container runtime overhead

OOM Protection

If a worker exceeds its 1024 MB hard limit, the Linux OOM killer terminates only that container. The EC2 instance and all other workers continue unaffected. The maintenance Lambda detects the missing worker within 60 seconds and the orchestrator restarts it automatically.

Bridge Networking¶

Workers use bridge mode instead of awsvpc:

Feature	awsvpc	bridge
ENI per task	Yes (1 each)	No (shared)
Max tasks (m/r xlarge)	~3	60+
Per-task security group	Yes	No (host SG)
Port mapping	Static	Dynamic
Cost impact	ENI limits require more instances	High density, fewer instances

The trade-off is acceptable because workers only need outbound internet to reach broker APIs. They don't receive inbound connections — all communication flows through Redis.

Lambda Orchestrator¶

Five Lambda functions form the serverless control plane:

Function	Trigger	Timeout	Memory	Concurrency	Purpose
worker_control	SQS FIFO (worker-control)	60s	256 MB	50–500	Start, stop, claim workers. Pool assignment and RunTask fallback.
order_tasks	SQS FIFO (order-tasks)	120s	256 MB	50–500	Background fill verification. Query broker for order status after execution.
maintenance	EventBridge (every 60s)	300s	256 MB	1	Fan-out coordinator. Scans Redis for all worker marks, partitions work, invokes maintenance_worker in parallel.
maintenance_worker	Lambda invoke (from maintenance)	30s	256 MB	100	Process individual orphan detection batch. Check ECS task status, clean up stale marks, stop orphan tasks.
pool_manager	EventBridge (every 5 min)	60s	256 MB	—	Count pool workers, compare to target, launch or terminate to match desired pool size.

Orchestrator Flow¶

sequenceDiagram participant API as FastAPI participant SQS as worker-control.fifo participant LC as λ worker_control participant Redis as Valkey participant Pool as Pool Worker participant Claim as pool-claim Queue participant ECS as ECS RunTask API->>SQS: Send start_worker message SQS->>LC: Trigger Lambda LC->>Redis: GET worker:active:{user_id} alt Worker already active LC-->>SQS: Delete message (no-op) else No active worker LC->>Redis: Check pool workers alt Pool has available worker LC->>Claim: Send claim message (user_id, credentials) Claim->>Pool: Pool worker receives claim Pool->>Redis: SET worker:active:{user_id} (TTL 30s) Pool->>Pool: Load credentials, connect to broker Note over Pool: ~332ms to ready else Pool empty LC->>ECS: RunTask (worker task definition) ECS->>ECS: Schedule on capacity provider Note over ECS: ~3103ms to ready end end

FIFO Guarantees

The worker-control.fifo queue uses user_id as the message group ID. This ensures that multiple start/stop commands for the same user are processed in order, preventing race conditions where a stop arrives before the start has completed.

SQS Queues¶

Queue	Type	Visibility Timeout	Retention	DLQ	DLQ Max Receives	Purpose
worker-control.fifo	FIFO	90s	1 day	worker-control-dlq.fifo	3	Worker lifecycle commands (start, stop, claim). Message group: user_id.
order-tasks.fifo	FIFO	180s	1 day	order-tasks-dlq.fifo	3	Fill verification, delayed order checks. Message group: order_id.
pool-claim	Standard	10s	5 minutes	—	—	One-shot claim messages for pool workers. Short retention because unclaimed messages are stale.

Dead Letter Queues¶

Both FIFO queues have DLQs that catch messages failing after 3 processing attempts. Both Lambda handlers use ReportBatchItemFailures so that only the specific failing record is retried — successfully processed records in the same batch are not re-delivered and do not have their receive count inflated.

DLQ	Retention	CloudWatch Alarm	Dashboard
worker-control-dlq.fifo	14 days	`{env}-orchestrator-dlq-has-messages` (> 0)	Yes
order-tasks-dlq.fifo	14 days	`{env}-order-tasks-dlq-has-messages` (> 0)	Yes

Both alarms send to the orchestrator-alerts SNS topic (email notification). The CloudWatch dashboard shows both DLQ message counts side by side.

DLQ Messages Are Genuine Failures

With ReportBatchItemFailures, only messages that truly failed 3 consecutive times reach the DLQ — no false positives from batch contamination. Common causes: Redis connectivity loss, ECS capacity exhausted, broker API persistently timing out. Action: check the corresponding Lambda error in CloudWatch Logs, fix the root cause, then redrive messages from the DLQ back to the main queue.