Tiny Container
Description
A lightweight container runtime and isolated code execution sandbox built directly on Linux primitives such as namespaces, cgroups, seccomp, and capabilities.
Overview
tinycontainer is built around three core concepts:
- Isolation using Linux namespaces and
pivot_root - Security using seccomp and Linux capabilities
- Resource Control using cgroup v2
The project uses a bootstrap process created inside new namespaces. Once the bootstrap process starts, it configures the isolated environment by preparing the root filesystem, mounts, cgroups, and security policies. Finally, it replaces itself with the target workload using syscall.Exec().
The project is divided into a few small components with clear ownership:
- API / CLI: Application entry points.
- Executor: Orchestrates code execution requests.
- Runtime: Manages isolated execution environments.
- Container: Creates and manages root filesystems.
- Workspace: Creates temporary directories for user code.
- Process: Handles namespace creation, bootstrap flow, and workload execution.
- Security: Applies privilege dropping, capabilities, and seccomp filters.
- Cgroup: Enforces resource limits.
Runtime Lifecycle
Each execution receives:
- A fresh workspace
- A fresh root filesystem
- Dedicated namespaces
- A dedicated cgroup
The runtime creates an isolated environment, launches a bootstrap process, applies resource limits and security policies, executes the workload, collects the output, and finally cleans up all temporary resources.
Flow:
- Request
- Workspace create
- Container create
- Launch Bootstrap
- Apply Cgroups
- Execute program
- Collect output
- Cleanup
Process Bootstrap Flow
Linux namespaces must be applied when a process is created. Because of this, tinycontainer uses a re-exec pattern.
The parent process launches a new instance of the current binary inside newly created namespaces. The child process detects bootstrap mode, configures the isolated environment, and finally replaces itself with the target workload using syscall.Exec().
This approach allows the same process to both prepare the sandbox and execute the untrusted code.
Cgroup Handling
Resource limits are enforced using cgroup v2.
Currently supported limits include:
- Memory limits
- PID limits
Each execution receives its own cgroup, and the process is attached to that cgroup before execution begins. Resource enforcement is performed directly by the Linux kernel.
Seccomp Filtering
Seccomp is used to reduce the system call surface available to untrusted workloads.
After namespaces are configured and privileges are dropped, a seccomp filter is applied to the process.
The current implementation follows a deny-list approach, blocking a small set of dangerous system calls while allowing everything else. This keeps the implementation simple but is not ideal for production-grade sandboxing. A strict allow-list approach would provide significantly stronger isolation.
Seccomp acts as an additional security layer on top of namespaces, capabilities, and cgroups.
Execution Pipeline
The execution pipeline is intentionally simple:
- Receive execution request.
- Create workspace.
- Write source code.
- Create container root filesystem.
- Launch bootstrap process.
- Create namespaces.
- Apply security restrictions.
- Apply cgroup limits.
- Execute workload.
- Collect stdout and stderr.
- Clean up resources.
Check my blog for more details.
