3 Security Architecture
mathew edited this page 2026-02-21 18:49:25 +00:00

Security Architecture

Storm Pulse is a read-write server management agent. It can restart services, rebuild containers, pull code, and run database migrations. A compromised agent is effectively root-equivalent on the target server. The security architecture is designed around that reality.


Threat model

What we're protecting: VPS servers running production services on the public internet, managed remotely from a central Django dashboard.

Attackers we consider:

  • Network attacker -- can scan ports, intercept traffic, attempt man-in-the-middle attacks between agent and dashboard.
  • Compromised dashboard user -- has access to the web UI but shouldn't be able to execute arbitrary commands on servers.
  • Compromised agent -- one VPS is breached. The attacker should not be able to pivot to other agents or forge commands.
  • Replay attacker -- captures valid command messages and retransmits them later.

What a successful attack looks like: Arbitrary command execution on a managed VPS. That's the worst case, and every layer exists to prevent it.


Five layers

Security is not a single mechanism. It's a stack where each layer compensates for the failure of the layer above or below it. An attacker must defeat all five layers to achieve arbitrary execution.

Layer 1 -- Network

Principle: The agent never listens on a port. There is nothing to connect to.

The agent initiates all connections outbound to the dashboard over WebSocket. The VPS firewall blocks all inbound traffic except SSH from known IP addresses. Port scanners find nothing. There is no service to probe, no endpoint to DDoS, no listener to exploit.

This is the phone-home model used by Cloudflare Tunnel, DigitalOcean's do-agent, and Portainer's Edge Agent. SaltStack's contrasting model -- requiring inbound ports 4505 and 4506 on managed nodes -- has led to documented security breaches. Storm Pulse follows the pattern that works.

The attack surface concentrates on the single Django dashboard endpoint, which is far easier to harden, monitor, and protect than 5-50 individual agent servers.

Layer 2 -- Transport

Principle: Both sides prove their identity cryptographically. Credentials never cross the wire.

All communication uses mutual TLS (mTLS). A private Certificate Authority (Smallstep step-ca) issues a unique client certificate per agent, with the agent ID embedded in the Subject Alternative Name.

How mTLS differs from API keys or JWTs: bearer tokens grant access to whoever holds them. If intercepted, they're immediately usable. mTLS proves identity during the TLS handshake using session-specific keys derived from the certificate. An attacker who captures network traffic cannot extract reusable credentials.

Implementation:

  • Nginx terminates mTLS in front of Django, passing the verified client certificate DN and serial number as headers.
  • Each agent gets a unique certificate that can be individually revoked if that server is compromised. Revoking one agent's cert doesn't affect any other agent.
  • Certificates are short-lived (90 days) with automated renewal.
  • The private key never leaves the agent's filesystem.

Layer 3 -- Application

Principle: Every command is signed. Replayed or forged commands are rejected.

On top of mTLS, every command the dashboard sends includes an HMAC-SHA256 signature computed over the command payload, a timestamp, and a unique nonce. The agent verifies three things before accepting a command:

  1. HMAC valid -- the signature matches, proving the command originated from the dashboard and hasn't been tampered with.
  2. Timestamp fresh -- the command was signed within the configured expiry window (default 60 seconds). Stale commands are rejected.
  3. Nonce unique -- the nonce hasn't been seen before. Seen nonces are tracked in a SQLite database. This prevents replay attacks where a valid captured command is retransmitted.

This layer exists because mTLS alone doesn't prevent a compromised intermediary from injecting messages into an established TLS session (though this is extremely difficult). Defense in depth means assuming any single layer can fail.

Layer 4 -- Execution

Principle: The agent can only do what the whitelist says. Nothing else.

Storm Pulse never executes arbitrary shell commands. A strict command registry defines every permitted action with its exact binary path, arguments, timeout, and whether it requires confirmation.

Protections:

  • No shell execution. Every command runs via subprocess.run() with shell=False. Shell metacharacters, pipes, redirects, and injection attempts are syntactically impossible.
  • Absolute binary paths. Commands specify /usr/bin/git, not git. PATH manipulation attacks don't apply.
  • Local parameter resolution. Placeholders like {project_dir} and {compose_file} are resolved from the agent's local config file, never from incoming messages. The dashboard cannot inject paths.
  • Unknown commands are rejected. If a command name isn't in the registry, it doesn't run. The agent doesn't try to interpret it.

The current whitelist covers five commands in the deploy group:

Command What it runs Timeout
git_pull git -C {project_dir} pull 60s
docker_build docker compose -f {compose_file} build 300s
docker_down docker compose -f {compose_file} down 60s
docker_up docker compose -f {compose_file} up -d 120s
django_migrate docker compose -f {compose_file} exec {service} python manage.py migrate 120s

Layer 5 -- OS

Principle: Even if the agent process is compromised, the damage is contained.

The agent runs as a dedicated stormpulse user with no root access. Privilege escalation is restricted at multiple levels:

Systemd sandboxing:

  • ProtectSystem=strict -- the entire filesystem is read-only except explicitly allowed paths.
  • NoNewPrivileges=yes -- the process cannot gain new privileges via setuid, setgid, or capabilities.
  • PrivateTmp=yes -- the agent gets its own /tmp, invisible to other processes.
  • ProtectHome=yes -- home directories are inaccessible.
  • ProtectKernelTunables=yes -- /proc/sys and similar are read-only.
  • ProtectKernelModules=yes -- loading kernel modules is blocked.
  • ProtectControlGroups=yes -- cgroup modifications are blocked.
  • ReadOnlyPaths=/ -- the root filesystem is explicitly read-only.
  • ReadWritePaths=/opt/stormpulse/data -- only the data directory is writable.

Targeted sudo via polkit: The stormpulse user has sudo permissions for only the specific binaries the command registry needs (docker, git). Polkit policies restrict which systemd units and docker operations are permitted.

An attacker who gains code execution within the agent process is trapped in a sandboxed, unprivileged user account with read-only filesystem access and no path to escalation.


Layer interaction

The layers are designed so that each one is independently useful, and failure of any single layer doesn't grant full access:

Scenario What stops it
Attacker scans the VPS looking for the agent Layer 1 -- no listening port exists
Attacker intercepts network traffic Layer 2 -- mTLS, no reusable credentials in transit
Attacker somehow injects a message into the TLS session Layer 3 -- HMAC verification fails
Attacker replays a captured valid command Layer 3 -- nonce already seen, rejected
Attacker sends a valid-looking command for rm -rf / Layer 4 -- not in whitelist, rejected
Attacker crafts a command with shell metacharacters Layer 4 -- shell=False, metacharacters are literal strings
Attacker compromises the agent process Layer 5 -- sandboxed user, read-only filesystem, no escalation path
One agent's certificate is stolen Layer 2 -- individual revocation, other agents unaffected

What this architecture does NOT protect against

Honesty about limitations is part of the security model:

  • Compromised dashboard. If an attacker gains access to the Django dashboard with valid credentials, they can send legitimately signed commands to any agent. mTLS and HMAC protect the channel, not the authorization decision. Per-command RBAC on the dashboard side mitigates this.
  • Compromised CA. If the Smallstep CA is breached, the attacker can issue valid certificates for any agent ID. The CA server must be hardened and access-restricted.
  • Supply chain attacks. A compromised dependency in the agent or dashboard could bypass all layers. The agent has two runtime dependencies (websockets, psutil) to minimize this surface.
  • Physical access. An attacker with root on the VPS can read the agent's certificates and HMAC key. At that point they already have more access than the agent provides.
  • Dashboard availability. If the dashboard goes down, agents cannot receive commands. They continue pushing metrics to a local SQLite buffer and reconnect with exponential backoff, but management capability is lost until the dashboard recovers.

Certificate lifecycle

  1. Enrollment. Operator installs the agent and runs stormpulse enroll <endpoint> <agent_id> <token>. The agent generates an EC P-256 keypair locally — the private key never leaves the machine. It builds a CSR with CN=<agent_id>, signed with the private key to prove possession. The CSR and one-time enrollment token are POSTed to the dashboard over standard HTTPS (no client cert yet — that's what enrollment provisions). The dashboard validates the token, signs the CSR with the private CA, and returns the signed certificate, CA certificate, and HMAC shared secret. The agent writes these to disk with strict permissions: private key and HMAC key at 0600 (owner-only), certificates at 0644. The enrollment token is burned after use and cannot be reused.
  2. Normal operation. All connections use mTLS. The agent presents its client certificate on every WebSocket connection.
  3. Renewal. Certificates are valid for 90 days. Automated renewal via step-ca before expiry.
  4. Revocation. If an agent is compromised, its certificate is revoked in the CA. The agent can no longer connect. No other agent is affected.

Design influences

This architecture follows patterns validated in production by:

  • Cloudflare Tunnel -- outbound-only connections, zero inbound ports, tunnel credentials over TLS.
  • Portainer Edge Agent -- outbound polling with mTLS and rotating credentials.
  • Teleport -- reverse tunnels with short-lived certificates from an internal CA.
  • Netdata ACLK -- outbound WebSocket with public/private key pairs.

The common thread: every modern agent designed for security uses outbound-only connections and certificate-based authentication. Storm Pulse follows this consensus.