Docker promises consistent environment setups and more through containerisation. Does it pair well with AEM? Learn when is best to pair them and when not to.
"It worked on my machine!" As a backend developer, that phrase sounds all too familiar. Due to inconsistent environment setups among development teams and along the deployment chain, it is almost inevitable that software sometimes shows unexpected behaviour depending on where you run it.
Wouldn't it be nice to have a consistent setup from the developer's environment all the way to production? Also, wouldn't it be nice if onboarding new developers were as easy as writing one line in the terminal, with no tedious installation of language versions and configuration of the environment required?
A framework called Docker promises not only that, but also easier deployments and quick scaling when it comes to software delivery, by pushing the complexity of environment setup into a so-called container.
The concept of containerisation is not exactly new. It has gained a lot of attention in the last years with the fast development and improvements of the Docker framework. However, since all software is not created equal, especially when it comes to AEM, we need to evaluate if AEM and Docker are a good fit before we sign the shipping papers.
Docker is a software container platform that allows you to run applications in an isolated environment with its own CPU, memory and network stack. While that might sound a lot like your standard virtual machine, there's a key difference between a VM and a container. The latter shares the Kernel of the host Operating System while the former is packing the overhead of an entire guest OS. Since it bundles nothing more than the libraries that are needed to run a software, a container is a lot lighter and provides faster instantiation. Unlike the VM, the container also doesn't need a hypervisor as an abstraction layer between the host and itself because of the shared Kernel.
A Docker container is started from a so-called Docker image, which simply put, is a snapshot of a self-contained file system. To build an image you would usually create a text file called the Dockerfile and start by choosing an already existing image to base it on. This, for example, can be a certain flavour of Linux to provide the necessary libraries or a development kit that you need for your application. You then define instructions in the Dockerfile to install dependencies and your application code, while each instruction creates its own immutable image layer. When building these layers, Docker is not very different from a version control system like Git. Each layer is defined by the file difference to the preceding image and thus its size is defined by only that file difference. Also, each layer is uniquely defined by a commit ID.
A read/write container layer is created when you start a container from an image, which is discarded when the container is stopped unless you commit the changes as a new layer.
Going straight to the point: AEM isn't exactly the best kind of application to put in a Docker container. According to its homepage, Docker can "Build, Ship, and Run Any App, Anywhere". That promise shouldn't be trusted blindly though. At its core, Docker is suited best for stateless microservices, not stateful monoliths like AEM.
AEM with its modular design based on OSGi is technically a collection of microservices but seen as a self-contained application. It's what you call a monolith. Docker containers show their full potential when used with lightweight services that instantiate quickly. Only then can you properly benefit from fast starting times of containers, compared to virtual machines that need to boot an entire OS first. But AEM usually takes minutes to start anyway, and a fresh installation without any content already uses over 540 MB of disk space before the first boot and 1.9 GB after the first instantiation (AEM 6.3). AEM in Docker can, therefore, neither profit much from the light footprint nor from the scalability through fast instantiation that you gain with containers over virtual machines.
Everything is content. A core notion so important that it's probably the first thing every novice AEM developer gets taught. AEM as an application is tightly integrated with its underlying content repository, which in turn makes the application itself content. And that is exactly the reason why it is problematic to put AEM in a Docker container.
Unlike a virtual machine, a container can't be stopped without destroying all files created by the application in the container itself. Of course, the system state of the container could be committed as a new image layer, but the main idea behind not putting a container in a persistent state is that the deployment of a new version is as easy as starting the new instance and deleting the old one, which in the case of AEM would also delete all the added content along with it.
There's another way in Docker to persist data and that is through data volumes. They are initialised when a container is created and don't depend on the container's lifecycle, i.e. they persist even if the container is deleted. This also makes the data volumes shareable and reusable among other containers.
However, this doesn't exactly solve our problem with AEM. If we move the repository to a persistent volume, we don't lose our content anymore. However, it would defeat the purpose of what Docker is meant for, since old application code would be persistent in the volume as well.
Let's put aside for a moment that our preconditions to utilize Docker on the whole software lifecycle aren’t optimal. How easy is it to get AEM to run inside of Docker in the first place? Would it really give us an advantage over using proven setups with a virtualization solution like Vagrant? To evaluate that, let's have a look at two equivalent, very basic setups where we start an unconfigured AEM author via the quickstart jar. For both options we'll perform the following steps:
(If you want to try the following code snippets yourself, please make sure to place the quickstart jar along with the license file in the same directory as your Vagrantfile and Dockerfile. For a more sophisticated setup with images for author, publish and load balancer instances, I recommend having a look at some of the public images on Docker Hub)
To create our VM and start AEM, we simply run the command vagrant up in the directory where our Vagrantfile resides.
The script is called by simply adding RUN startAEM.sh to the Dockerfile after the line in which we unpack the jar. (This solution is, of course, not very elegant but serves its purpose for the sake of the argument). After making this small addition, the Docker container had an average instantiation time of 53 seconds.
As we can see, Docker really does give us a significant improvement when it comes to starting an AEM instance. However, this time saving comes at a price: When we take a look at the image sizes, the uninitiated image has a total size of 1.7GB:
Ok, so AEM is heavyweight and it's not possible to separate the application data completely from the content. But as we have seen, we get significant improvements with regards to instantiation time. Let's have a look at some different use cases and evaluate where we can benefit from Docker or when the stateless nature of containers is a showstopper.
The example above only installs a very basic AEM instance, but in real world applications, a lot more server configurations are needed. To manage these configurations and to avoid Snowflake Servers, there are several ways to automate the provisioning process. In the Vagrant example, we used a shell script as a provisioner but for more complex tasks, you could choose a provisioning tool like Puppet, Chef or Ansible, to name three of the most popular ones. Docker works a bit different in that regard since technically, the provisioning is done directly in the Dockerfile via shell commands.
In some cases, and I believe AEM is one of them, combining Docker with a dedicated provisioner is a better option. Gladly, all mentioned tools can also be used to configure Docker containers. So when it comes to Configuration Management, Docker is at least on par with virtualisation tools like Vagrant.
As mentioned earlier, the great benefit of using Docker in production would be the ability to tear down a container and replace it with a new application version without significant configuration effort and zero downtime. But even with the data being persisted on a Docker Volume, you'd still have to deploy into a live volume to update the application there as well. That's the main reason why AEM in its current implementation shouldn't be used for production environments.
For the local environment of a developer, live deployment into a volume is less risky of course. So Docker could be used to streamline the development process a bit. However, I don't consider the benefit for two main reasons:
From personal experience, I can say that while Docker doesn't give me much benefit over the setups I'm usually working with, I enjoy having a disposable instance at hand that I can, for example, use to test Service Packs or different java versions.
But there is actually a use case where we don't need to persist data in AEM and where instances are so short-lived that cleaning up created content might even be a benefit. In Continuous Integration pipelines, Docker can be used to run tests on feature branches. In this case, Docker is surprisingly efficient for AEM since we can greatly benefit from the relatively fast instantiation and the incremental layer structure in a Docker image. You could, for example, have a basic Docker image set-up with the code from the develop branch of your project. This initial image would be quite large in size since it's holding the entire application. However, you could then start a container for each branch that's going off the develop branch and deploy the code changes to the respective AEM instances running inside the containers.
As mentioned above, Docker images are read-only, which is why all feature branches can access the same base image. Therefore, the size of the containers you created would approximately be the size of the code changes in the respective branch. The branch specific containers can then be used to run tests, and can easily be destroyed and cleaned up by just being stopped.
So, are AEM and Docker a good match? At Netcentric, we have successfully been using Docker for testing feature branches in some of our Continuous Delivery Pipelines for quite a while now. This is where Docker excels together with AEM today. For use-cases where the content of an instance mustn't be discarded (most notably production environments), I wouldn't recommend switching to Docker at the moment.
AEM in its current form just isn't meant to be put in a stateless container. However, there's a big emphasis on the "current form". If it will be possible in future versions of AEM to use multiple content repositories, we would be able to separate generated content from application content. That way we could have all the stateless application data running in the container while the generated content resides safely in a Docker Volume. But until that happens, I recommend sticking to proven solutions for production and development servers for now.