Container Escape Prevention — You deploy containers without hardening. Privileged container, attacker has root on host. Lateral movement, your cluster is compromised.
You deploy containers without hardening. Privileged container, attacker has root on host. Lateral movement, your cluster is compromised. Here's how to prevent it.
What is a container escape? Simply explained.
A container escape is when an attacker breaks out of container isolation and gains access to the host OS or other containers. Common vectors: privileged containers (--privileged), Docker socket mounts (/var/run/docker.sock), missing seccomp/AppArmor profiles, kernel CVEs (runc/containerd). Container escapes are more common than kernel exploits — usually misconfiguration, not 0-day exploits. Good container security: Pod Security Standards (restricted), seccomp/AppArmor profiles, read-only root filesystem, gVisor sandbox for untrusted workloads.
↓ Jump to technical depth5 Container Escape Vectors & Fixes
Running containers with --privileged grants full host capabilities. Trivial escape via /proc/sysrq-trigger, device mounts, or kernel module loading.
# WRONG — never run in production
docker run --privileged myimage
# CORRECT — drop ALL capabilities, add only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage
# Kubernetes: enforce via Pod Security Standards
apiVersion: v1
kind: Pod
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
privileged: false
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"] # only if neededMounting host paths like /, /etc, /var/run/docker.sock gives container full host access. Docker socket mount = root on host.
# WRONG — mounts giving host escape
docker run -v /:/host myimage # Full host filesystem
docker run -v /etc:/etc myimage # Host config
docker run -v /var/run/docker.sock:/var/run/docker.sock myimage # Docker-in-Docker escape
# CORRECT — mount only what's needed, read-only where possible
docker run -v /data/app:/app:ro myimage
# Kubernetes: OPA Gatekeeper policy blocking docker.sock mounts
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPVolumeTypes
metadata:
name: psp-volume-types
spec:
match:
kinds: [{apiGroups: [""], kinds: ["Pod"]}]
parameters:
volumes:
- "configMap"
- "emptyDir"
- "projected"
- "secret"
- "downwardAPI"
- "persistentVolumeClaim"
# hostPath EXCLUDED — no host mountsWithout seccomp, containers can call any kernel syscall. 300+ syscalls available — many enable privilege escalation (ptrace, mount, keyctl, clone with new namespaces).
# Apply seccomp RuntimeDefault (blocks 100+ dangerous syscalls)
# Docker:
docker run --security-opt seccomp=/path/to/profile.json myimage
# Kubernetes — apply to all containers via RuntimeDefault:
apiVersion: v1
kind: Pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # Blocks dangerous syscalls automatically
# AppArmor (Ubuntu/Debian):
docker run --security-opt apparmor=docker-default myimage
# OpenClaw: detect containers running without seccomp profile
openclaw check --namespace production --rule no-seccomp-profileWritable container filesystem allows attackers to modify binaries, add persistence, install tools after gaining initial access.
# Kubernetes: enforce read-only root filesystem
spec:
containers:
- name: app
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp # tmpfs for writable temp
- name: var-run
mountPath: /var/run # tmpfs for runtime files
volumes:
- name: tmp
emptyDir:
medium: Memory
- name: var-run
emptyDir:
medium: Memory
# Docker:
docker run --read-only --tmpfs /tmp --tmpfs /var/run myimageSharing host PID/IPC/network namespaces with the container collapses isolation boundaries. hostPID=true allows container to see and signal all host processes.
# Kubernetes: explicitly prohibit host namespace sharing
spec:
hostPID: false # Default false — but set explicitly
hostIPC: false # Default false — but set explicitly
hostNetwork: false # Default false — only enable if actually needed
# OPA Gatekeeper constraint to block hostPID/hostIPC:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPHostNamespace
metadata:
name: psp-host-namespace
spec:
match:
kinds: [{apiGroups: [""], kinds: ["Pod"]}]
# Constraint: hostPID and hostIPC must be falseReal-World Scars: Production Incidents
Privileged container deployed, attacker has root on host, attacks all other containers. Fix: Enforce Pod Security Standards 'restricted', no privileged containers.
Docker socket mounted in container, attacker controls Docker daemon, spawns malicious containers. Fix: Block hostPath mounts, OPA Gatekeeper for volume types.
Immediate Actions: What to do today?
Enable Pod Security Standards
Enforce 'restricted' profile at cluster level.
Enforce seccomp RuntimeDefault
seccompProfile.type: RuntimeDefault for all pods.
Read-only root filesystem
readOnlyRootFilesystem: true for all containers.
gVisor for untrusted workloads
RuntimeClass gvisor for multi-tenant.
Interactive Container Security Checklist
Container Security Score Calculator
Industry Average: 15/100
Frequently Asked Questions
What is a container escape and how common are they?
A container escape is when an attacker who has compromised a container process manages to break out of the container isolation and gain access to the host OS or other containers. Common vectors: privileged containers (trivially exploitable), docker.sock mounts (instant root on host), kernel CVEs in runc/containerd (CVE-2024-21626 runc escape), dangerous syscalls without seccomp (ptrace, mount). Real-world frequency: container escapes are found in nearly every Kubernetes security audit. Misconfigurations (privileged containers, socket mounts) are far more common than actual kernel exploits — and easier to fix.
What is gVisor and when should I use it?
gVisor is a user-space kernel written in Go (developed by Google) that intercepts container syscalls before they reach the host kernel. Instead of syscalls going directly to the Linux kernel, gVisor's Sentry handles them. Why this matters: even if a container process exploits a kernel vulnerability, it exploits gVisor's kernel — not the host kernel. A gVisor escape would require a second exploit against gVisor itself. Use gVisor when: running untrusted workloads (e.g., user-submitted code), multi-tenant Kubernetes where tenant isolation is critical, any workload that would otherwise require privileged containers. Tradeoff: ~10-20% performance overhead for syscall-heavy workloads. Kubernetes RuntimeClass: set runtimeClassName: gvisor on sensitive pods.
How does OpenClaw detect container escape attempts at runtime?
OpenClaw integrates with Falco for runtime detection. Key rules: 1) Privileged container process spawning shell (container_shell_from_privileged). 2) Write to sensitive host paths from within a container (/etc, /proc/sysrq-trigger). 3) Docker socket access from within a container (fd opened matching /var/run/docker.sock). 4) Unexpected capability use (ptrace, mount syscall from non-init container). 5) Namespace escape indicators (setns syscall, clone with CLONE_NEWUSER from container). 6) Unexpected network connections from a container to host-only subnets. Alerts route to your SIEM via OpenClaw's webhook integration.
What is the most impactful single change to prevent container escapes?
Enforce Pod Security Standards at the cluster level. A single admission webhook enforcing the 'restricted' PSS profile prevents: privileged containers, hostPath mounts, hostPID/hostIPC/hostNetwork, missing seccomp profiles, privilege escalation, running as root. One OPA Gatekeeper or Kyverno policy set enforcing PSS 'restricted' eliminates the most common container escape vectors cluster-wide. This is more impactful than addressing individual vulnerabilities because it prevents entire classes of misconfiguration. Enable immediately on new clusters; use 'warn' mode on existing clusters first to identify violations, then switch to 'enforce'.