As part of the work I am carrying out at INAF (the Italian National Institute for Astrophysics) and within the EU-funded ESCAPE project, I am developing a so called science platform.
Aim of such platforms is to provide simplified access to computing and storage resources, and to make it easy (and reproducible) to run scientific codes on them.
The science platform we are developing at INAF (Rosetta), which is focused on resource-intensive data analysis, makes strong use of software containerisation to achieve this goal, and I therefore had the chance to take a deep dive in the entire container ecosystem.
The complexity is astonishing. And perhaps as any complex technology simplified enough to get mass adoption, a lot of details are hidden when you just run a docker run hello-world
.
The diagram below tries to summarise the situation as of today, and most importantly to clarify the relationships between the various moving parts.
Not all container orchestrators, engines and runtimes available as of today are included in this diagram nor discussed in this article, and most notably
Nabla,
Shifter, Nomad and Marathon are not covered, but there should be enough of them to cover the various peculiarities in order to then generalise.
As it will be tried to explain over this article, in the transition from standalone, monolithic projects as LXC and Docker to the open container initiative (OCI) standards, the terminology got quite convoluted, and the same component can behave as two different ones, depending on how it is framed (e.g. an engine can became a runtime of another engine).
A set of definitions is therefore definitely required to navigate the ecosystem.
If you are running single containers, you will interact with a container engine, which will in turn interact with a container runtime, either monolithically built-in or as a module. This leaving out the near-nonsense of engines that can behave as runtimes and that can thus be used from other engines.
If you are instead running set of containers, you will then use a container orchestrator. Which one to use will depend on the use case and deployment complexity. Moreover, this is true if we leave out a new trend of building orchestrators on top of other orchestrators (e.g. Portainer), in which case the orchestrator will interact with... another orchestrator. Lastly, no one prevents you from using an orchestrator to run a set of containers with a single element, thus effectively running a single container, which is something you might want to do in particular in Cloud environments (e.g. in AWS ECS or Fargate).
The container engine we all know. Being a monolithic project in the beginning, it has since then been refactored to support both the need of an open source ecosystem and architectural flexibility. At different stages, the Docker GitHub repository was renamed in Moby and the internal, built-in runtime was extracted as Containerd.
Docker identified indeed for a long time many things: a container engine, a runtime, a registry, an image format, a project and.. a company. It is a normal part of software projects to get refactored, however with Docker and Kubernetes this generated a bit of confusion. As of today, the Docker Engine is to be intended as an open source software for Linux, while Docker Desktop is to be intended as the freemium product of the Docker, Inc. company for Mac and Windows platforms. From Docker's product page: "Docker Desktop includes Docker Engine, Docker CLI client, Docker Build/BuildKit, Docker Compose, Docker Content Trust, Kubernetes, Docker Scan, and Credential Helper".
Podman is a daemonless container engine for developing, managing, and running OCI Containers on your Linux System. Containers can either be run as root or in rootless mode. Podman is a near drop-in replacement for the Docker engine which can run containers in userspace. On shared systems, it is probably the best possible tradeoff between usability and security, as it allows to operate both as root and standard user. Unlike other userspace container solutions, in Podman users can become root inside the container even if outside they are standard users, which makes it extremely powerful.
Podman has a few issues with user IDs (UID) management when running when running in rootless mode and UIDs close to 65536. For example, to allow the advanced package tool (APT) to work on Debian/Ubuntu-based containers, its UID must be reassigned not to clash with some forbidden ones [4], e.g.: groupadd -g 600 _apt
and usermod -g 600 _apt
. Moreover, by default the user outside the container is mapped to root inside the container, and non-root user mapped to to other UIDs [5]. Take home message: a terrific piece of software, but beware UIDs.
Singularity (now Apptainer) should be thought more as a virtual environment on steroids rather than as a container engine. It indeed does not enforce (or even permit) robust isolation between the containers and the host, leaving large portions exposed. This is not only a security issue but most importantly it makes the container behaviour susceptible of being affected by external factors. In the shipping container analogy, you can think about Singularty containers as if they have no walls.
More in detail, by default directories as the /home
folder, /tmp
, /proc
, /sys
, and /dev
are all shared with the host, environment variables are exported as they are set on host, the PID namespace is not created from scratch, and the network and sockets are as well shared with the host. Moreover, Singularity maps the user outside the container as the same user inside it, meaning that every time a container is run the user UID (and name) can change inside it, making it very hard to handle permissions.
Two issues opened on the former Singularity project are quite self-explanatory: Same container, different results and Python3 script fails in singularity container on one machine, but works in same container on another. In both cases the issue was due to lack of isolation between the container and the host.
Containerd, which will be introduced in the runtimes section, it is not intended to be directly used as an engine (being a runtime), but with the Containerd CLI (ctr
) utility it can behave as such. If you are curious, here is a primer for how to use it. I included it in the list for completeness and to show how mutable definitions can be in the container ecosystem, as for CRI-O crictl below.
CRI-O, which will be introduced in the runtimes section as well, is not intended to be directly used as an engine too. However, with the crictl
command line utility it can behave as such, mainly for debugging purpose. To underline that CRI-O is not intended to be directly used from a command line (being a runtime), the official CRI-O code repository states that "any CLIs built as part of this project are only meant for testing this project and there will be no guarantees on the backward compatibility with it". In any case, here is a tutorial for running a Redis service using CRI-O with crictl
if you are curious.
LXD is something tangential to a container engine, as it allow to manage both containers and virtual machines, offering "a unified user experience around full Linux systems running inside containers or virtual machines". LXD uses LXC as internal runtime. Rocket (RKD) was instead a command line utility for running containers on Linux directly using kernel-level calls, similarly as for LXC, and is as of today an ended project.
Containerd is an high-level container runtime originated from Docker, and extracted out from Docker itself for flexibility over the years. A default Docker engine installation will install Containerd as well. Containerd is also the default Kubernetes CRI runtime. Containerd uses runC as its default low-level runtime.
CRI-O is an "implementation of the Kubernetes CRI (Container Runtime Interface) to enable using OCI (Open Container Initiative) compatible runtimes" [6]. It basically tried to fill some gaps along the Kubernetes development and is now a direct competitor (if it makes sense to call it as such) to Containerd. CRI-O uses runC as its default low-level runtime as well.
RunC is an OCI-compatible container runtime. It implements the OCI specification and runs the container processes. RunC is called the reference implementation of OCI [7].
gVisor is a runtime developed by Google which implements kernel virtualisation. In other words, each container has its own kernel, unlike other container runtimes where the kernel is usually shared between the host and the containers. It allows for more security than other runtimes while allowing to share host resources without pre-allocation.
Kata containers is a runtime implementing hardware virtualisation (aka: a virtual machine). The idea is to have a runtime which behave as running software containers but that under the hood spawn a new virtual machine and run the container inside it. It is very interesting in terms of security and hardware emulation for multi-architecture tests. On the cons side, it requires pre-allocation of resources, and in particular of the memory which is set by default to 2GB per container.
Docker compose allows to define simple multi-service applications where all the containers run on the same node. It creates a dedicated network for the containers on the host from which they can all talk each others, and a docker-compose.yml
file describes how to assemble them. It the simplest orchestrator, and very useful for local and simple deployments. Docker compose has support only for the Docker APIs, and Podman can work with it by emulating Docker.
Docker Swarm is similar to Docker compose but it can manage multi-node deployments, or on other words a cluster of Docker engines called a "swarm". As for Docker compose, Docker Swarm supports only the Docker APIs. An in-between solution, but still very useful for small teams where using managed orchestrators is not possible and configuring Kubernetes would require too much effort.
Kubernetes is the full-featured solution for container orchestration, supporting a variety of settings, network topologies and container runtimes. In 2021 it dropped support for Docker, which generated some panic over the internet. What it actually happened is that it dropped support for Dockershim in favour of directly using Containerd, and nothing changed for the users. Kubernetes adds the notion of "pod" to the container ecosystem, and can support multiple container runtimes by defining pods with different settings. Mastering Kubernetes is hard, and even an entry-level setup can take time. Kubernetes can be accessed both using a CLI and a set of REST APIs.
Amazon Web Services's Elastic Compute Service is Amazon's internal implementation of a Kubernetes-like solution. Amazon virtual machines require the Docker Engine to be installed in order to be managed using ECS [9]. Alternatively, customers can directly use a virtual machine image pre-build by Amazon which is already configured for using it with ECS. The main point is that AWS ECS use the Docker engine, and not a container runtime.
AWS Fargate is likely one of these "definitive" solutions that will become the new normal for a large number of use cases (as it happened with RDS). Fargate allow executing containers in a serverless fashion, on AWS computing infrastructure, and to entirely forget about the underlying OS (and hardware, of course). Interestingly enough, by probably being a project younger than ECS, it could make the strategic move of stopping to rely on container engines in favour of adopting container runtimes. In particular, with Fargate platform version 1.4 in April 2020, they replaced the Docker Engine with Containerd as Fargate’s container execution engine [10].
The container ecosystem is moving fast. After Docker, intended both as a company and as a technology, enabled mass adoption of containerisation back in 2013 a lot changed, in particular over the last years.
The need of decoupling the internal components of early container engines came out only when container orchestrators started to require more flexibility on how to run containers (as it happened for Kubernetes and Docker itself, which stemmed out the Containerd runtime).
The Open Container Initiative born along the way is trying to give standard and well-defined formats and interfaces, however the entanglement of Docker within other technologies and services is still very strong and causes confusion.
Newer or well maintained projects have it easier from this prospective, as Kubernetes or Amazon Fargate, since they can just stop supporting Docker as an engine and move altogether to container runtimes (as Docker-derived Containerd, which provides strong back-compatibility in the transition). This approach allows to easily plug-in other runtimes as well, and thus to support more usage scenarios, as for example improving security with kernel or hardware level virtualisation using gVisor or Kata runtimes.
As a general comment, we will probably still have to live with this confusion for a while, but the path is set. I hope that this article will be useful for anyone who find himself lost in today's container ecosystem complexity and in particular in contextualising some technical details which are more there for historical and refactoring reasons than for explicit architectural choices.
p.s. Have I missed something? Any feedback is welcome! My contact details are in the footer.
I would like to thank Giuliano Taffoni for all the discussions we had around software containerisation, and John Swinbank for his feedback. I would also like to thank Alessandro Angioi for his suggestions around Docker Swarm. Lastly, I would like to thank the ESCAPE project (Horizon 2020 Grant Agreement no. 824064) for funding my work, from which this article stemmed out.