How I Built tinycontainer: A Lightweight Container Runtime in Go

Posted on: 29 May 2026 | Last updated: 29 May 2026 | Read: 13 min read | In: Linux, Go

Description

A deep dive into building a lightweight container runtime and code execution sandbox using Linux namespaces, cgroups, seccomp, capabilities, and process isolation. This article covers the architecture, challenges, and lessons learned while exploring how containerization works under the hood.

Full blog

Background

Back in college, I built an online code execution platform that accepted user-submitted code, ran it inside Docker containers (with some security measures), and returned the output. Once that project was finished, I had an itch to build a container runtime myself. I knew containers were built using Linux primitives such as namespaces and cgroups, so I decided to see how much I could build on my own.

Since I was learning Rust at the time, I decided to experiment with those concepts and built a small project called rusty-sandbox.

The project mostly failed. I was able to launch processes with basic namespace isolation, but the resource limits I implemented did not work properly. A process could still consume excessive resources and impact the host system. Eventually, I paused the project to focus on other things, including LeetCode.

In May 2026, I revisited the idea during a long holiday. This time I built a new version in Go called tinycontainer. Unlike my earlier attempt, this version implemented process isolation and resource limits using namespaces, cgroups, capabilities, seccomp, and filesystem isolation techniques.

It is still an educational project, not a production-ready container runtime. However, it was complete enough to execute code in an isolated environment(with resource limit) and gave me a much deeper understanding of how containers actually work.

One interesting thing about containers is that there isn't a single, universally accepted definition of one. Linux itself doesn't provide a dedicated container feature. What we call a container is usually a combination of namespaces, cgroups, filesystem isolation, and process management.

This is also why container runtimes can differ significantly while still being considered container runtimes. Containers are less of a specific Linux feature and more of an abstraction built on top of several existing kernel primitives.

In this article, I'll walk through that journey and cover:

  1. Why the Rust attempt failed
  2. What concepts I used
  3. Why the re-exec pattern confused me
  4. Mistakes I made
  5. Things I'd change

Key concepts

I won't go into exhaustive detail about every concept here, but I'll cover enough to explain how they fit together and how they helped me build tinycontainer. If you want to dive deeper, I highly recommend reading the relevant man pages and other resources shared at last.

tinycontainer is built around three core concepts:

  1. Isolation using Linux namespaces and pivot_root
  2. Security using seccomp and Linux capabilities
  3. Resource Control using cgroup v2

namespaces

Namespaces are one of the fundamental building blocks of containerization. They allow a process to have its own view of certain parts of the system without affecting the host or other processes.

A simple way to think about a namespace is as a private view of a system resource. Depending on the namespace type, a process may see a different set of processes, mount points, hostnames, users, or network interfaces.

In other words:

Namespaces control what a process can see and interact with.

They provide strong isolation, but they are not a complete security boundary by themselves. Container runtimes typically combine namespaces with additional mechanisms such as seccomp, capabilities, and cgroups.

Linux provides several namespace types, and container runtimes commonly make use of most of them.

Creating a New Namespace

The easiest way to experiment with namespaces is through the unshare command.

unshare follows an opt-in model. Unless you explicitly request a namespace, the process continues to use the host's namespace.

For example:

sudo unshare --pid --mount --fork --uts /bin/bash

This command starts a new shell with new PID, mount, and UTS namespaces.

The --fork flag is important because a PID namespace only applies to child processes. Once inside the shell, run: echo $$. You should see: 1

The shell becomes PID 1 inside the new PID namespace, even though it has a different PID from the host's perspective.

Cgroups

Control groups, or cgroups, are another important part of this project. While namespaces provide isolation, cgroups provide resource control.

A process running inside its own namespace can still consume excessive CPU or memory on the host. Namespaces change what a process can see, but they do not limit how many resources it can use.

Cgroups allow us to enforce limits on resources such as:

  • Memory usage
  • CPU usage
  • The number of processes a program can create (PIDs)

By placing a process inside a cgroup, we can ensure that it stays within defined resource limits.

Filesystem isolation

In tinycontainer, I use pivot_root for filesystem isolation. However, since pivot_root is a bit harder to experiment with from the terminal, we'll use chroot to understand the basic idea first.

The two are not the same. chroot simply changes the apparent root directory for a process and its children, while container runtimes typically use pivot_root together with mount namespaces to create a more complete filesystem environment. For learning purposes, though, chroot is much easier to demonstrate.

Let's start by creating a minimal Debian filesystem:

mkdir test
sudo debootstrap --variant=minbase stable ./test http://deb.debian.org/debian

Once that's done, run:

sudo chroot test /bin/bash

This starts a new shell whose root directory is test.

From inside that shell:

/      -> host's /test
/etc   -> host's /test/etc

In other words, when the shell accesses /, it is actually accessing the test directory on the host.

To verify this, run:

mkdir /testing

Inside the chroot environment, this appears to create:

/testing

However, from the host's perspective, the directory is actually created at:

test/testing

This happens because the shell now treats test as its root directory.

It's worth noting that chroot alone is not a complete isolation mechanism. It changes a process's view of the filesystem, but it does not create a new mount namespace or provide the same guarantees that container runtimes typically rely on.

This is really useful because Linux is heavily filesystem-oriented. Once we set up a root filesystem for the process, most of the files it needs can come from that environment instead of directly from the host filesystem.

Capabilities

Traditionally, Linux followed a simple model: a process was either running as root (UID 0) or it wasn't. Root processes were granted broad privileges, while non-root processes were subject to normal permission checks.

Linux capabilities split many of those root privileges into smaller, independent permissions. Instead of giving a process unrestricted root access, we can grant or remove specific capabilities depending on what it needs to do.

For a code execution sandbox, this is useful because most programs do not need powerful capabilities such as:

  • CAP_SYS_ADMIN
  • CAP_SYS_BOOT
  • CAP_SYS_MODULE

Removing these capabilities reduces the number of privileged operations a process can perform and limits its ability to affect the host system.

In tinycontainer, I drop a number of capabilities before executing user code. This is an additional security layer on top of namespaces and filesystem isolation.

It's worth noting that my implementation takes an opt-out approach: I start with the default capability set and remove capabilities that I consider dangerous. A more secure approach would be opt-in, where all capabilities are dropped first and only the ones strictly required by the workload are added back.

Capabilities are an extremely powerful Linux feature, and I am only using a small subset of what they can do. Properly managing capabilities can become surprisingly complex, which is one reason production container runtimes spend a lot of effort getting this right.

Seccomp

Seccomp (Secure Computing Mode) is a Linux kernel feature that allows us to restrict which system calls a process can make. Since processes interact with the kernel through system calls, reducing the available syscalls also reduces the attack surface.

A simple way to think about seccomp is as a gatekeeper between the process and the kernel. Whenever a process requests an operation from the kernel, the seccomp filter decides whether that syscall should be allowed or blocked.

Unlike capabilities, which control privileged operations, seccomp controls which syscalls can be used. Modern seccomp filters can also inspect syscall arguments, allowing more fine-grained policies.

Like capabilities, I'm not fully utilizing seccomp in this project. The implementation follows an opt-out approach, which isn't ideal, but it provides a reasonable starting point for an educational project.

Production container runtimes often combine seccomp with other security mechanisms such as AppArmor or SELinux for additional protection.

Re-exec

Re-exec confused me a lot. I understood the idea, but figuring out how to use it in a container runtime took me a while. After experimenting with a few commands and watching this video, it finally made sense.

The core idea is simple: a process replaces itself by executing another program (or even the same program again). Unlike fork, which creates a new process, exec keeps the same PID and replaces the process image.

You can see this in action:

  1. Run echo $$ in your shell.
  2. Run exec /bin/bash.
  3. Run echo $$ again.

The PID should remain the same.

Container runtimes often use re-exec after setting up namespaces, mounts, cgroups, and other isolation features. Instead of continuing execution from the current state, the process starts again from a known entry point inside the newly created environment.

In my project, the flow looks roughly like this:

Start process
Re-exec into bootstrap mode
Setup isolation
Exec user code

To avoid an infinite loop, I pass an extra argument that tells the program which phase it is currently running in.

Mistakes I Made / Why the Rust Version Didn't Work

I'm generally a learn-by-doing type of person. Most of the time that works well for me. I usually pick a language or framework, start building something, and learn along the way.

That approach did not work particularly well for this project.

The biggest reason the Rust version failed was that I jumped into implementation before I fully understood the underlying concepts. I knew about namespaces, cgroups, and process isolation at a high level, but I hadn't spent enough time understanding how they fit together before writing code.

The second mistake was choosing Rust while I was still learning Rust. This wasn't Rust's fault. The language introduces concepts such as ownership and borrowing that require a different way of thinking. Instead of learning Linux internals and Rust at the same time, I probably should have reduced the number of unknowns and focused on one problem at a time.

For many of my previous projects, I could rely on strict linting, good documentation, and incremental learning. That approach usually worked. For a systems project like this, I found that spending more time reading documentation, man pages, and existing implementations upfront paid off much more than immediately writing code.

Looking back, I don't consider the Rust version a waste of time. If anything, it taught me exactly what I was missing. Most of the mistakes I corrected in tinycontainer were mistakes I first made in rusty-sandbox.

If I decide to seriously learn Rust again, I'll give it the time and attention it deserves rather than trying to learn both the language and a complex systems topic at the same time.

Things I'd Change

There are still plenty of areas where this project could be improved.

From a security perspective, I would spend more time:

  1. Hardening the seccomp profile
  2. Taking a stricter approach to capabilities
  3. Exploring additional security layers such as AppArmor

Beyond security, there are a few practical improvements I'd like to make as well.

One thing I removed early in development was an interactive runtime. The project originally supported both an interactive environment and code execution, but I eventually removed the interactive mode to keep the scope manageable. Looking back, I would probably keep it because it makes testing and experimentation much easier.

I'd also like to improve the testing story. This project is heavily tied to Linux primitives such as namespaces, mounts, and process execution, which makes traditional unit testing difficult. Most of the interesting behavior only exists once the process is actually running inside the isolated environment.

Related to that, I would spend more time making the project easier to test in general. In most software projects, I try to design components so they can be tested independently. That was much harder here because many parts of the runtime depend on each other. Namespaces, filesystem setup, cgroups, capabilities, and process execution are all tightly connected, which makes isolation and testing more challenging.

Finally, I would like to support more languages and frameworks. The runtime itself is language-agnostic, so expanding language support would mainly be a matter of building and maintaining additional root filesystems.

Closing Thoughts

I really enjoyed building this project.

From a purely career-oriented perspective, I could have spent the time grinding LeetCode or building another application project. Instead, I chose to spend it learning how containers work under the hood. I don't regret that decision at all.

This project scratched an itch I had for a long time. More importantly, it gave me a much better understanding of the Linux primitives that power modern container runtimes. Many of the commands and concepts that previously felt like magic now make a lot more sense.

It also increased my appreciation for Linux and the people who build and maintain the software we rely on every day.

Whenever I work on niche projects like this, I end up reading old blog posts, watching conference talks, digging through source code, and learning from comments written years ago by people I've never met. There is something special about being able to learn from others across time in that way.

So I'd like to end by thanking the people who take the time to teach, write documentation, answer questions, and share what they know. Projects like this would have been significantly harder without them.

References

Additional references and resources will be added over time.