Worker Isolation System¶

The defining architectural decision of 4pass is per-user worker isolation: every active user gets a dedicated ECS container running their broker session. This is not a convenience — it is a requirement driven by four hard constraints.

Business takeaway

Per-user isolation is non-negotiable for broker API session binding and fault containment. Pool pre-warming (897ms claim latency) makes this model economically viable at scale — without it, cold-start latency of 45-60 seconds would be unacceptable for a trading platform. The result: users get dedicated, fault-isolated broker sessions with sub-second readiness at a variable cost of $1.61/month per user.

Why Per-User Isolation¶

Constraint	Explanation
Broker API Binding	Some broker SDKs (notably Shioaji) bind sessions to the originating IP or process. Sharing a process across users would break session affinity.
Credential Security	Each worker loads exactly one user's decrypted broker credentials. No worker ever has access to another user's secrets, even in the event of a memory dump or core dump.
Fault Isolation	A broker SDK crash, OOM, or hung connection kills only that user's worker. All other users are unaffected. The orchestrator restarts the failed worker within 60 seconds.
Resource Isolation	Memory-hungry broker sessions are capped at 1024 MB hard limit. A runaway process triggers an OOM kill on the container only — the EC2 host and its other 59 workers continue uninterrupted.

Non-Negotiable

Shared-worker architectures fail here. A single Shioaji session consuming 800 MB would starve other users. A broker API timeout in one session would block the event loop for everyone. Per-user isolation is the only viable model.

Worker Lifecycle¶

stateDiagram-v2 [*] --> PoolIdle: Lambda launches task\n(USER_ID = -1) PoolIdle --> Claimed: Receives pool-claim\nmessage (~332ms) Claimed --> Connecting: Load credentials\nInitialize broker SDK Connecting --> Active: Broker session\nestablished Connecting --> Failed: Connection error Failed --> Connecting: Auto-retry\n(3 attempts) Failed --> Terminated: Max retries exceeded Active --> Processing: Order received\nfrom Redis queue Processing --> Active: Order complete\nResponse written Active --> HealthCheck: 60s health check HealthCheck --> Active: Healthy HealthCheck --> Reconnecting: Connection lost Reconnecting --> Active: Reconnected Reconnecting --> Terminated: Reconnect failed Active --> IdleTimeout: No activity\nfor 30 minutes IdleTimeout --> Terminated: Graceful shutdown\nClose broker session Terminated --> [*]: ECS task stopped\nRedis mark expires state Active { [*] --> Heartbeat Heartbeat --> Heartbeat: Refresh Redis mark\nevery 5s (TTL 30s) }

Lifecycle Timings¶

State Transition	Duration	Mechanism
Pool idle → Claimed	~332ms	SQS `pool-claim` message received
Claimed → Active	1–5s	Credential decryption + broker handshake
Active → Processing	<10ms	Redis BLPOP on request queue
Processing → Active	50–500ms	Broker API call + response write
Active → Idle timeout	30 min	No orders or control messages
Health check interval	60s	Broker session ping
Heartbeat interval	5s	Redis SET with 30s TTL

Pool Pre-Warming¶

The pool is the key to sub-second worker readiness. Pre-warmed workers sit in the ECS cluster with USER_ID=-1, running a stripped-down event loop that does nothing but wait for a claim message on the SQS pool-claim queue.

How It Works¶

pool_manager Lambda runs every 5 minutes via EventBridge
Counts current pool workers in Redis (key pattern: worker:active:-1:* or metadata scan)
Compares to target pool size (Terraform variable)
Launches or terminates workers to match target

Claim Flow¶

API determines user needs a worker → sends to worker-control.fifo
worker_control Lambda checks Redis for pool availability
Lambda sends claim message to pool-claim queue: { user_id, encrypted_credentials }
Pool worker's BLPOP picks up the message
Worker transitions from USER_ID=-1 to USER_ID={claimed_user}
Worker decrypts credentials, establishes broker connection
Worker sets worker:active:{user_id} in Redis with 30s TTL
Worker begins heartbeat loop (5s interval)

Benchmark Comparison¶

Path	Step	Latency
Pool Claim	SQS → Lambda invocation	565ms
	Lambda → pool claim → worker ready	332ms
	Total	897ms
RunTask (cold)	SQS → Lambda invocation	565ms
	Lambda → ECS RunTask → task running	3,103ms
	Total	3,659ms
Cold EC2 Start	No instance available → ASG launch	45,000–60,000ms
	+ ECS task scheduling	+3,000ms
	Total	48,000–63,000ms

4× Faster with Pool

Pool pre-warming reduces user-perceived latency from 3.6 seconds to under 1 second. For a trading platform where seconds matter, this is the difference between executing at the intended price and missing the move.

Redis Queue Communication¶

All communication between the API and workers flows through Redis lists using a request/response pattern. There is no direct network connection between the API process and any worker.

Key Patterns¶

Key	Type	TTL	Operations
`trading:user:{user_id}:requests`	List	None	API: LPUSH / Worker: BLPOP (blocking, 30s timeout)
`trading:response:{request_id}`	String	60s	Worker: SET / API: GET with polling
`worker:active:{user_id}`	String	30s	Worker: SET every 5s / API & Lambda: GET
`worker:metadata:{user_id}`	Hash	30s	Worker: HSET (task_arn, instance_id, launched_at)
`worker:control:{user_id}:messages`	List	None	API: LPUSH / Worker: LPOP (non-blocking check)

Heartbeat Mechanism¶

The heartbeat is the foundation of the worker liveness protocol:

Worker calls SET worker:active:{user_id} {timestamp} EX 30 every 5 seconds
If the worker dies, the key expires in 30 seconds (6 missed heartbeats)
API checks the key before routing orders — if absent, triggers worker start
Maintenance Lambda scans all worker:active:* keys to detect orphans

The 30s TTL with 5s refresh gives a 25-second buffer — enough to survive brief Redis connectivity blips without falsely declaring a worker dead.

Connection Management¶

Each worker maintains broker connections lazily — connections are created on first use and cached for the session lifetime.

Connection Key¶

Connections are keyed on a 4-tuple: (broker_name, broker_type, simulation, account_id). A user with both a Shioaji simulation account and a Shioaji live account gets two separate broker sessions.

Connection Lifecycle¶

Event	Action
First order for account	Create connection, authenticate with broker
Subsequent orders	Reuse cached connection
Connection error	Mark invalid, auto-reconnect on next order
Health check (60s)	Ping broker API, verify session alive
Idle timeout (30m)	Close all connections, shut down worker
Credential reload	Close affected connection, reconnect with new credentials

Health Check¶

Every 60 seconds during idle periods, the worker pings each active broker connection:

If healthy: no action
If unhealthy: close connection, mark for reconnect on next order
If broker API unreachable: log warning, retry on next check

This catches stale sessions before they cause order failures.

Control Messages¶

Workers monitor a control message queue in addition to the order queue:

worker:control:{user_id}:messages

Currently supported control messages:

Message	Action
`reload_credentials`	Close broker connections, re-fetch and decrypt credentials from database, reconnect. No worker restart needed.
`shutdown`	Graceful shutdown — close all connections, stop heartbeat, exit.

This enables credential rotation without service interruption. When a user updates their broker password in the dashboard, the API pushes a reload_credentials message, and the worker picks it up within seconds.

Orphan Detection & Recovery¶

The maintenance system runs continuously to detect and clean up inconsistencies between Redis state and ECS reality.

flowchart TB EB["EventBridge Every 60 seconds"] --> MT["λ maintenance (coordinator)"] MT -->|Scan| Redis["Valkey All worker:active:* keys"] MT -->|List| ECS["ECS All running tasks"] MT -->|Compare| Decision{Detect anomalies} Decision -->|Orphan marks Redis key, no ECS task| MW1["λ maintenance_worker Delete stale Redis keys"] Decision -->|Orphan tasks ECS task, no Redis key| MW2["λ maintenance_worker Stop ECS tasks"] Decision -->|Stale marks Heartbeat expired| MW3["λ maintenance_worker Clean up resources"] MW1 & MW2 & MW3 -->|Results| CW["CloudWatch Metrics Shioaji/Maintenance namespace"]

Anomaly Types¶

Anomaly	Detection	Resolution	Cause
Orphan Mark	Redis key exists, no matching ECS task	Delete Redis key	Task crashed without cleanup
Orphan Task	ECS task running, no Redis key	Stop ECS task	Redis key expired (network partition)
Stale Mark	Redis key TTL expired but not deleted	Clean up associated keys	Worker froze (deadlock, OOM)
Duplicate Workers	Two tasks for same user_id	Stop older task	Race condition in claim flow

Fan-Out Architecture¶

The maintenance function uses a fan-out pattern for parallel processing:

Coordinator (maintenance) scans all Redis marks and ECS tasks in one pass
Partitions anomalies into batches of 100
Invokes maintenance_worker Lambda for each batch (async InvokeAsync)
Each worker processes its batch independently
Results published to CloudWatch custom metrics

This scales linearly: at 10,000 active workers, the coordinator invokes ~100 maintenance_worker Lambdas that process in parallel, completing the full scan in under 5 seconds.

Resource Allocation¶

Resource	Value	Notes
CPU	64 units	3.125% of one vCPU. Sufficient — workers are I/O-bound (waiting on broker API responses).
Memory (soft limit)	384 MB	ECS scheduling target. Normal working set is 150–250 MB.
Memory (hard limit)	1024 MB	OOM kill threshold. Broker SDK memory leaks are caught here.
ECS task count per instance	60	Conservative cap (could theoretically fit 84 by soft limit).

OOM Kill Behavior

When a container exceeds 1024 MB, the Linux kernel's OOM killer terminates the container process. The EC2 instance is unaffected — all other containers continue running. The ECS agent detects the stopped task, the maintenance Lambda detects the missing heartbeat within 60 seconds, and the orchestrator launches a replacement. The user experiences a brief interruption (< 2 minutes) but no data loss — all order state is in Redis and PostgreSQL.