Table of Contents

Security Architecture

Threat model
Five layers

Layer 1 -- Network
Layer 2 -- Transport
Layer 3 -- Application
Layer 4 -- Execution
Layer 5 -- OS

Layer interaction
What this architecture does NOT protect against
Certificate lifecycle
Design influences

Security Architecture

Storm Pulse is a read-write server management agent. It can restart services, rebuild containers, pull code, and run database migrations. A compromised agent is as dangerous as the operator account it runs as. On a hardened box that account is a rootless, sudo-less user confined to its own rootless-docker namespace (detailed below), not host root, though still the most privileged thing touching that server's services. The security architecture is designed around that reality.

Threat model

What we're protecting: VPS servers running production services on the public internet, managed remotely from a central Django dashboard.

Attackers we consider:

Network attacker -- can scan ports, intercept traffic, attempt man-in-the-middle attacks between agent and dashboard.
Compromised dashboard user -- has access to the web UI but shouldn't be able to execute arbitrary commands on servers.
Compromised agent -- one VPS is breached. The attacker should not be able to pivot to other agents or forge commands.
Replay attacker -- captures valid command messages and retransmits them later.

What a successful attack looks like: Arbitrary command execution on a managed VPS. That's the worst case, and every layer exists to prevent it.

Five layers

Security is not a single mechanism. It's a stack where each layer compensates for the failure of the layer above or below it. An attacker must defeat all five layers to achieve arbitrary execution.

Layer 1 -- Network

Principle: The agent never listens on a port. There is nothing to connect to.

The agent initiates all connections outbound to the dashboard over WebSocket. The VPS firewall blocks all inbound traffic except SSH from known IP addresses. Port scanners find nothing. There is no service to probe, no endpoint to DDoS, no listener to exploit.

This is the phone-home model used by Cloudflare Tunnel, DigitalOcean's do-agent, and Portainer's Edge Agent. SaltStack's contrasting model -- requiring inbound ports 4505 and 4506 on managed nodes -- has led to documented security breaches. Storm Pulse follows the pattern that works.

The attack surface concentrates on the single Django dashboard endpoint, which is far easier to harden, monitor, and protect than 5-50 individual agent servers.

Layer 2 -- Transport

Principle: Both sides prove their identity cryptographically. Credentials never cross the wire.

All communication uses mutual TLS (mTLS). A private Certificate Authority (Smallstep step-ca) issues a unique client certificate per agent, with the agent ID embedded in the Subject Alternative Name.

How mTLS differs from API keys or JWTs: bearer tokens grant access to whoever holds them. If intercepted, they're immediately usable. mTLS proves identity during the TLS handshake using session-specific keys derived from the certificate. An attacker who captures network traffic cannot extract reusable credentials.

Implementation:

Caddy terminates mTLS in front of Django, passing the verified client certificate DN and serial number as headers.
Each agent gets a unique certificate that can be individually revoked if that server is compromised. Revoking one agent's cert doesn't affect any other agent.
Certificates are short-lived (90 days) with automated renewal.
The private key never leaves the agent's filesystem.
The agent's TLS context loads the system CA bundle (for verifying public CAs like Let's Encrypt on the dashboard) and additionally loads the private CA certificate from config (tls.ca_cert). This allows agents to verify dashboard servers using either public or private certificates without requiring system-wide CA installation.

Layer 3 -- Application

Principle: Every command is signed. Replayed or forged commands are rejected.

On top of mTLS, every command the dashboard sends includes an HMAC-SHA256 signature computed over the command payload, a timestamp, and a unique nonce. The agent verifies three things before accepting a command:

HMAC valid -- the signature matches, proving the command originated from the dashboard and hasn't been tampered with.
Timestamp fresh -- the command was signed within the configured expiry window (default 60 seconds). Stale commands are rejected.
Nonce unique -- the nonce hasn't been seen before. Seen nonces are tracked in a SQLite database. This prevents replay attacks where a valid captured command is retransmitted.

This layer exists because mTLS alone doesn't prevent a compromised intermediary from injecting messages into an established TLS session (though this is extremely difficult). Defense in depth means assuming any single layer can fail.

Layer 4 -- Execution

Principle: The agent can only do what the whitelist says. Nothing else.

Storm Pulse never executes arbitrary shell commands. A strict command registry defines every permitted action with its exact binary path, arguments, timeout, and whether it requires confirmation.

Protections:

No shell execution. Every command runs via subprocess.run() with shell=False. Shell metacharacters, pipes, redirects, and injection attempts are syntactically impossible.
Absolute binary paths. Commands specify /usr/bin/git, not git. PATH manipulation attacks don't apply.
Local parameter resolution. Placeholders like {project_dir} and {compose_file} are resolved from the agent's local config file, never from incoming messages. The dashboard cannot inject paths.
Unknown commands are rejected. If a command name isn't in the registry, it doesn't run. The agent doesn't try to interpret it.

The built-in whitelist covers two commands:

Command	Group	What it runs	Timeout
`git_pull`	deploy	`git -C {project_dir} pull`	60s
`docker_logs`	diagnostics	`docker compose -f {compose_file} logs --tail {tail_lines} {service}`	30s

Additional commands can be defined in the TOML config (e.g. docker_build, docker_up, django_migrate). Each custom command follows the same rules: absolute binary paths, shell=False, regex-validated parameters.

When the optional Garage S3 integration is enabled, a separate command set is registered for bucket and key management (e.g. garage_bucket_allow, garage_bucket_allow_rw, garage_bucket_allow_ro, garage_key_create). These execute inside Docker containers via docker exec with the same subprocess protections. Permission-granting commands are split by tier (_rw for read-write, _ro for read-only) rather than accepting a permissions parameter, keeping the template system free of conditional logic.

Log shipping uses the same subprocess discipline. The docker_raw source type calls docker logs with shell=False and an absolute binary path taken from the agent's local TOML config -- never from the wire. Container names come from the same local config; the dashboard cannot inject a container name into the command. Log lines are validated against strict format parsers before being shipped, so raw container output is never passed through unfiltered.

Internal commands. A growing set of commands execute inside the agent process rather than via subprocess. garage_refresh triggers a state collection. The Garage data-plane commands (garage_bucket_clear, garage_walk_bucket_stats) use a hand-rolled SigV4 S3 client (stdlib + cryptography only -- no boto3, no shell, no subprocess) to operate against the local Garage data-plane endpoint. buckets_custom_domain_caddy_sync talks to Caddy's admin HTTP API and writes atomically to a drop-in Caddyfile fragment. The shell=False discipline doesn't apply because there is no shell call at all. Layers 1-3 still apply unchanged: every internal command still requires HMAC + timestamp + nonce verification before dispatch, and must be registered in the whitelist to be dispatchable.

Caddy admin API as a new outbound authority. When Caddy integration is enabled, the agent talks to Caddy's admin HTTP API at the configured admin_url (default http://localhost:2019). This is a new outbound authority, not a new inbound listener — Layer 1 is unchanged. Caddy's admin API has no authentication by convention and relies on being bound to localhost; the agent inherits that trust model. A non-localhost admin_url is the operator's responsibility to secure (firewall, mTLS proxy in front of :2019). The fragment params flow through the same HMAC + nonce + max-bytes pipeline as any other long-running command. The max_bytes validation primitive (added for the Caddyfile fragment) enforces a configurable byte cap on opaque-content params; ParamDef now rejects any declaration with neither a pattern nor a max_bytes set, preventing the "unvalidated param" footgun at config load time.

Long-running commands. Commands marked long_running in the registry produce zero-or-more command.progress events between the originating command.request and the terminal command.result. The protocol pattern is documented in Protocol Specification — Long-running commands. HMAC verification happens once, at dispatch -- subsequent progress events ride agent→dashboard on the already-authenticated WebSocket. Cancelled or agent-disconnected jobs emit no terminal result; the dashboard treats disconnect as failure.

The sign-off seal. Storm Pulse registers one command, run_verify_block, whose argv carries opaque shell text on the wire: ["/bin/bash", "-c", "{verify_command}"], 4 KiB cap, no regex on the payload. It exists because the Storm Developments dashboard's sign-off checklist needs per-row verify shell that can't be expressed as a baked template. That one command breaks the Layer 4 promise above: an HMAC-signed envelope can ask for arbitrary shell. The sign-off seal is the layer that bounds when that's allowed.

A freshly installed agent is sealed. stormpulse init writes a signoff.sealed marker into the agent state directory, and build_registry(..., signoff_sealed=True) excludes run_verify_block from the registry the agent advertises on register. If the dashboard dispatches it anyway, the agent re-stats the marker at dispatch time and returns command.result with success: false, failure_reason: "signoff_sealed". The register payload carries signoff_sealed: bool on every (re)connect so the dashboard mirrors the state in Server.signoff_sealed.

Unsealing is host-only and loud:

stormpulse signoff unseal prints the consequences (RCE re-opens; persistence survives reseal; reseal is a kill switch, not a recovery) and refuses to proceed unless the operator types the host's hostname back at the prompt. Automation must pass --confirm-hostname HOSTNAME; the friction stays visible in source.
While unsealed, the agent emits a WARNING log every 5 minutes naming the unsealed duration, advertises the open hatch on every register, and tracks unsealed_since so "unsealed for X" stays accurate across agent restarts.
stormpulse signoff seal reseals in one keystroke. The safe direction needs no friction.

The dashboard has no path to seal or unseal. The agent has no whitelisted command that touches the seal file, so a compromised dashboard cannot reseal an agent it just unsealed, and a run_verify_block payload cannot toggle the flag from inside. The seal is a state, not a one-shot: re-verification of the same install repeats unseal → verify → reseal as many times as the install's lifetime needs. See ADR CORE-004 for the full design.

Layer 5 -- OS

Principle: Even if the agent process is compromised, the damage is contained.

The agent runs as the operator's admin user under a user systemd unit (systemctl --user). No system user, no root privileges, no sudo grants, no setuid binaries, no polkit hooks. The blast radius of a compromised agent process is whatever its launching user already had.

Systemd sandboxing. The user unit generated by stormpulse init enables the hardening directives that apply in user mode:

NoNewPrivileges=yes -- the process cannot gain new privileges via setuid, setgid, or capabilities.
PrivateTmp=yes -- the agent gets its own /tmp, invisible to other processes.
ProtectKernelTunables=yes, ProtectKernelModules=yes, ProtectControlGroups=yes -- kernel surfaces are read-only.
ProtectSystem=strict plus explicit ReadWritePaths for the agent's state directory and the operator's project directory.

No privilege escalation surface. Container management runs against rootless dockerd -- there is no /var/run/docker.sock ownership to escalate through, no docker group, no root daemon. Git operations use the operator's own credentials. The agent never invokes sudo.

An attacker who gains code execution within the agent process inherits the operator user's privileges and nothing more -- confined to that user's home, that user's rootless docker namespace, and the sandboxing surface above.

Layer interaction

The layers are designed so that each one is independently useful, and failure of any single layer doesn't grant full access:

Scenario	What stops it
Attacker scans the VPS looking for the agent	Layer 1 -- no listening port exists
Attacker intercepts network traffic	Layer 2 -- mTLS, no reusable credentials in transit
Attacker somehow injects a message into the TLS session	Layer 3 -- HMAC verification fails
Attacker replays a captured valid command	Layer 3 -- nonce already seen, rejected
Attacker sends a valid-looking command for `rm -rf /`	Layer 4 -- not in whitelist, rejected
Attacker crafts a command with shell metacharacters	Layer 4 -- `shell=False`, metacharacters are literal strings
Attacker dispatches `run_verify_block` against a sealed agent	Layer 4 -- registry excludes the command when sealed; agent re-stats the seal at dispatch and returns `failure_reason: signoff_sealed`
Attacker compromises the agent process	Layer 5 -- sandboxed user, read-only filesystem, no escalation path
One agent's certificate is stolen	Layer 2 -- individual revocation, other agents unaffected

What this architecture does NOT protect against

Honesty about limitations is part of the security model:

Compromised dashboard. If an attacker gains access to the Django dashboard with valid credentials, they can send legitimately signed commands to any agent. mTLS and HMAC protect the channel, not the authorization decision. Per-command RBAC on the dashboard side mitigates this.
Compromised CA. If the Smallstep CA is breached, the attacker can issue valid certificates for any agent ID. The CA server must be hardened and access-restricted.
Supply chain attacks. A compromised dependency in the agent or dashboard could bypass all layers. The agent has three runtime dependencies (websockets, psutil, cryptography) to minimize this surface. New capabilities -- including the SigV4 S3 client added for garage_bucket_clear -- are written against stdlib + the existing dependencies rather than pulling in new vendor libraries.
Customer secrets passed through long-running command params. The Garage data-plane commands (garage_bucket_clear, garage_walk_bucket_stats) carry a customer-controlled S3 admin secret inside the signed command.request envelope. The agent constructs a client from the secret, uses it for the duration of the job, and drops the reference -- the secret is never persisted, never logged, and never appears in result payloads. It does live in agent process memory while the job runs. An attacker who has already achieved code execution inside the agent process during one of these jobs could read it. Layer 5 sandboxing limits the value of this access, and the secret only authorizes operations on the customer's own bucket. New long-running commands that need the same pattern should follow the same approach (per-job client, sensitive_output=true, no standing credentials) rather than introducing agent-held S3 keys.
The unsealed window. While an operator has run stormpulse signoff unseal, run_verify_block accepts opaque shell from any HMAC-signed envelope. A compromised dashboard during that window achieves arbitrary execution on the host. Reseal is a kill switch for new shell after the marker flips back, not a recovery: anything that ran during the unsealed window already ran, and persistence implanted then survives the reseal. Mitigations: keep the window short (the agent nags every 5 minutes; the dashboard banners and pages), and treat reseal as closing the hatch, not undoing the unseal.
Physical access. An attacker with root on the VPS can read the agent's certificates and HMAC key. At that point they already have more access than the agent provides.
Dashboard availability. If the dashboard goes down, agents cannot receive commands. They continue pushing metrics to a local SQLite buffer and reconnect with exponential backoff, but management capability is lost until the dashboard recovers.

Certificate lifecycle

Enrollment. Operator installs the agent and runs stormpulse enroll <endpoint> <agent_id> <token>. The agent generates an EC P-256 keypair locally — the private key never leaves the machine. It builds a CSR with CN=<agent_id>, signed with the private key to prove possession. The CSR and one-time enrollment token are POSTed to the dashboard over standard HTTPS (no client cert yet — that's what enrollment provisions). The dashboard validates the token, signs the CSR with the private CA, and returns the signed certificate, CA certificate, and HMAC shared secret. The agent writes these to disk with strict permissions: private key and HMAC key at 0600 (owner-only), certificates at 0644. The enrollment token is burned after use and cannot be reused.
Normal operation. All connections use mTLS. The agent presents its client certificate on every WebSocket connection.
Renewal. Certificates are valid for 90 days. Automated renewal via step-ca before expiry.
Revocation. If an agent is compromised, its certificate is revoked in the CA. The agent can no longer connect. No other agent is affected.

Design influences

This architecture follows patterns validated in production by:

Cloudflare Tunnel -- outbound-only connections, zero inbound ports, tunnel credentials over TLS.
Portainer Edge Agent -- outbound polling with mTLS and rotating credentials.
Teleport -- reverse tunnels with short-lived certificates from an internal CA.
Netdata ACLK -- outbound WebSocket with public/private key pairs.

The common thread: every modern agent designed for security uses outbound-only connections and certificate-based authentication. Storm Pulse follows this consensus.