AEM and Docker - Are They a Good Fit?

Docker promises consistent environment setups and more through containerisation. Does it pair well with AEM? Learn when is best to pair them and when not to.

"It worked on my machine!" As a backend developer, that phrase sounds all too familiar. Due to inconsistent environment setups among development teams and along the deployment chain, it is almost inevitable that software sometimes shows unexpected behaviour depending on where you run it.

Wouldn't it be nice to have a consistent setup from the developer's environment all the way to production? Also, wouldn't it be nice if onboarding new developers were as easy as writing one line in the terminal, with no tedious installation of language versions and configuration of the environment required?

A framework called Docker promises not only that, but also easier deployments and quick scaling when it comes to software delivery, by pushing the complexity of environment setup into a so-called container.

The concept of containerisation is not exactly new. It has gained a lot of attention in the last years with the fast development and improvements of the Docker framework. However, since all software is not created equal, especially when it comes to AEM, we need to evaluate if AEM and Docker are a good fit before we sign the shipping papers.

What is Docker?

Docker is a software container platform that allows you to run applications in an isolated environment with its own CPU, memory and network stack. While that might sound a lot like your standard virtual machine, there's a key difference between a VM and a container. The latter shares the Kernel of the host Operating System while the former is packing the overhead of an entire guest OS. Since it bundles nothing more than the libraries that are needed to run a software, a container is a lot lighter and provides faster instantiation. Unlike the VM, the container also doesn't need a hypervisor as an abstraction layer between the host and itself because of the shared Kernel.

Caption: original diagram here.

Apart from the mentioned performance benefits, one of the biggest selling points for Docker is its portability of containers. They run virtually anywhere (pun intended), and they do so very consistently. If a Docker container runs on one machine, it will run the same on all machines that are capable of running the Docker engine, little to no configuration required.

Images built in layers

A Docker container is started from a so-called Docker image, which simply put, is a snapshot of a self-contained file system. To build an image you would usually create a text file called the Dockerfile and start by choosing an already existing image to base it on. This, for example, can be a certain flavour of Linux to provide the necessary libraries or a development kit that you need for your application. You then define instructions in the Dockerfile to install dependencies and your application code, while each instruction creates its own immutable image layer. When building these layers, Docker is not very different from a version control system like Git. Each layer is defined by the file difference to the preceding image and thus its size is defined by only that file difference. Also, each layer is uniquely defined by a commit ID.

A read/write container layer is created when you start a container from an image, which is discarded when the container is stopped unless you commit the changes as a new layer.

Core design patterns

Going straight to the point: AEM isn't exactly the best kind of application to put in a Docker container. According to its homepage, Docker can "Build, Ship, and Run Any App, Anywhere". That promise shouldn't be trusted blindly though. At its core, Docker is suited best for stateless microservices, not stateful monoliths like AEM.

Microservice vs Monolith

AEM with its modular design based on OSGi is technically a collection of microservices but seen as a self-contained application. It's what you call a monolith. Docker containers show their full potential when used with lightweight services that instantiate quickly. Only then can you properly benefit from fast starting times of containers, compared to virtual machines that need to boot an entire OS first. But AEM usually takes minutes to start anyway, and a fresh installation without any content already uses over 540 MB of disk space before the first boot and 1.9 GB after the first instantiation (AEM 6.3). AEM in Docker can, therefore, neither profit much from the light footprint nor from the scalability through fast instantiation that you gain with containers over virtual machines.

Stateless vs Stateful

Everything is content. A core notion so important that it's probably the first thing every novice AEM developer gets taught. AEM as an application is tightly integrated with its underlying content repository, which in turn makes the application itself content. And that is exactly the reason why it is problematic to put AEM in a Docker container.

Unlike a virtual machine, a container can't be stopped without destroying all files created by the application in the container itself. Of course, the system state of the container could be committed as a new image layer, but the main idea behind not putting a container in a persistent state is that the deployment of a new version is as easy as starting the new instance and deleting the old one, which in the case of AEM would also delete all the added content along with it.

There's another way in Docker to persist data and that is through data volumes. They are initialised when a container is created and don't depend on the container's lifecycle, i.e. they persist even if the container is deleted. This also makes the data volumes shareable and reusable among other containers.

However, this doesn't exactly solve our problem with AEM. If we move the repository to a persistent volume, we don't lose our content anymore. However, it would defeat the purpose of what Docker is meant for, since old application code would be persistent in the volume as well.

Let's put AEM in a container anyway

Let's put aside for a moment that our preconditions to utilize Docker on the whole software lifecycle aren’t optimal. How easy is it to get AEM to run inside of Docker in the first place? Would it really give us an advantage over using proven setups with a virtualization solution like Vagrant? To evaluate that, let's have a look at two equivalent, very basic setups where we start an unconfigured AEM author via the quickstart jar. For both options we'll perform the following steps:

(If you want to try the following code snippets yourself, please make sure to place the quickstart jar along with the license file in the same directory as your Vagrantfile and Dockerfile. For a more sophisticated setup with images for author, publish and load balancer instances, I recommend having a look at some of the public images on Docker Hub)

Vagrant

First, let's see how a basic Vagrant setup could look like. We need to define an OS for the virtual machine and allocate the system resources. We'll then execute a bootstrap shell script to perform the steps mentioned above.

Vagrantfile:

# -*- mode: ruby -*-

# vi: set ft=ruby :

Vagrant.configure("2") do |config|

end

bootstrap.sh:

#!/bin/sh

#Install Java

sudo yum -y update

sudo yum -y install java-1.8.0-openjdk

mkdir /opt/cq

cp /vagrant/cq-author-4502.jar /opt/cq/

cp /vagrant/license.properties /opt/cq/

cd /opt/cq

java -XX:MaxPermSize=256m -Xmx1024M -jar cq-author-4502.jar -unpack -r nosamplecontent

crx-quickstart/bin/quickstart

To create our VM and start AEM, we simply run the command vagrant up in the directory where our Vagrantfile resides.

Docker

Now let's do the same in Docker. Since containers don't simulate a machine but rather share the host's kernel on the process level, we don't need to allocate system resources. The container will just take as many resources as it needs and at most, what the host's kernel scheduler will allow. Technically for Docker, we could have saved the step to install Java by just choosing the official OpenJDK Docker image as a base. But to make our examples a bit more comparable, we will use the CentOS 7 image as a base instead and install Java manually.

Dockerfile:

FROM centos:7

MAINTAINER martinloeffler

RUN yum -y update

RUN yum -y install java-1.8.0-openjdk

# Copies required build media

ADD cq-author-4502.jar /opt/cq/cq-author-4502.jar

ADD license.properties /opt/cq/license.properties

# Extracts AEM

WORKDIR /opt/cq

RUN java -XX:MaxPermSize=256m -Xmx1024M -jar cq-author-4502.jar -unpack -r nosamplecontent

EXPOSE 4502

CMD crx-quickstart/bin/quickstart

To build the Docker image and add it to the local image registry, we execute the following command in the Folder where the Dockerfile resides:

docker build -t aem-test .

-t aem-test defines a tag for our image. The following command is then used to start a container from our newly created image:

docker run --name AEM_AUTHOR_6.3 -p 4502:4502 -d aem/author:6.3

With --name AEM_AUTHOR_6.3 we define a name for our container, -p 4502:4502 maps the internal AEM port that we used to start AEM to an external port on the host machine and -d aem/author:6.3 t ells Docker to use the aem/author image version 6.3 and run it in detached mode (otherwise the container would exit when the root process used to run the container exits).

Comparison

instantiation_time_docker.sh:

#!/bin/sh

timestamp() {

}

getBundleStatus() {

}

instantiate() {

}

timekeeping() {

}

for i in {1..5}

do

done

For the Vagrant setup, I used a slightly modified version of this script with the respective commands to start the VM. To get an average startup time, I took the measurements five times for each setup. The tests showed that Docker had an average instantiation time of 4 minutes while Vagrant took more than double the time with around 8 minutes and 40 seconds. This huge time difference is mainly due to two reasons:

We can go a step further to speed up the Docker instantiation even more. When booting AEM, it makes a big time difference if an instance is started for the first time or if it has been done at least once before. Since an image is just the snapshot of a filesystem, we can start the AEM instance directly after unpacking the quickstart jar during the creation of the Docker image. To do that, I used the same method as for the time measurements to determine when the instance has started and the image build process is complete:

startAEM.sh:

#!/bin/sh

getBundleStatus() {

}

/opt/cq/crx-quickstart/bin/quickstart &

while [ "$(getBundleStatus)" != "Bundle information: 519 bundles in total - all 519 bundles active." ]; do

done

echo "$(date +"%T"): STARTED!"

The script is called by simply adding RUN startAEM.sh to the Dockerfile after the line in which we unpack the jar. (This solution is, of course, not very elegant but serves its purpose for the sake of the argument). After making this small addition, the Docker container had an average instantiation time of 53 seconds.

As we can see, Docker really does give us a significant improvement when it comes to starting an AEM instance. However, this time saving comes at a price: When we take a look at the image sizes, the uninitiated image has a total size of 1.7GB:

In our simple example, we only have a single author instance. When you think of a complete setup with an author, a publish and a load balancer, you easily exceed the 10GB mark and that doesn't even include any additional application code or generated content.

Use cases

Ok, so AEM is heavyweight and it's not possible to separate the application data completely from the content. But as we have seen, we get significant improvements with regards to instantiation time. Let's have a look at some different use cases and evaluate where we can benefit from Docker or when the stateless nature of containers is a showstopper.

Configuration Management

The example above only installs a very basic AEM instance, but in real world applications, a lot more server configurations are needed. To manage these configurations and to avoid Snowflake Servers, there are several ways to automate the provisioning process. In the Vagrant example, we used a shell script as a provisioner but for more complex tasks, you could choose a provisioning tool like Puppet, Chef or Ansible, to name three of the most popular ones. Docker works a bit different in that regard since technically, the provisioning is done directly in the Dockerfile via shell commands.

In some cases, and I believe AEM is one of them, combining Docker with a dedicated provisioner is a better option. Gladly, all mentioned tools can also be used to configure Docker containers. So when it comes to Configuration Management, Docker is at least on par with virtualisation tools like Vagrant.

In our simple example, we only have a single author instance. When you think of a complete setup with an author, a publish and a load balancer, you easily exceed the 10GB mark and that doesn't even include any additional application code or generated content.

Production

As mentioned earlier, the great benefit of using Docker in production would be the ability to tear down a container and replace it with a new application version without significant configuration effort and zero downtime. But even with the data being persisted on a Docker Volume, you'd still have to deploy into a live volume to update the application there as well. That's the main reason why AEM in its current implementation shouldn't be used for production environments.

Dev Environment

For the local environment of a developer, live deployment into a volume is less risky of course. So Docker could be used to streamline the development process a bit. However, I don't consider the benefit for two main reasons:

From personal experience, I can say that while Docker doesn't give me much benefit over the setups I'm usually working with, I enjoy having a disposable instance at hand that I can, for example, use to test Service Packs or different java versions.

Continuous Integration

But there is actually a use case where we don't need to persist data in AEM and where instances are so short-lived that cleaning up created content might even be a benefit. In Continuous Integration pipelines, Docker can be used to run tests on feature branches. In this case, Docker is surprisingly efficient for AEM since we can greatly benefit from the relatively fast instantiation and the incremental layer structure in a Docker image. You could, for example, have a basic Docker image set-up with the code from the develop branch of your project. This initial image would be quite large in size since it's holding the entire application. However, you could then start a container for each branch that's going off the develop branch and deploy the code changes to the respective AEM instances running inside the containers.

As mentioned above, Docker images are read-only, which is why all feature branches can access the same base image. Therefore, the size of the containers you created would approximately be the size of the code changes in the respective branch. The branch specific containers can then be used to run tests, and can easily be destroyed and cleaned up by just being stopped.

Conclusion

So, are AEM and Docker a good match? At Netcentric, we have successfully been using Docker for testing feature branches in some of our Continuous Delivery Pipelines for quite a while now. This is where Docker excels together with AEM today. For use-cases where the content of an instance mustn't be discarded (most notably production environments), I wouldn't recommend switching to Docker at the moment.

AEM in its current form just isn't meant to be put in a stateless container. However, there's a big emphasis on the "current form". If it will be possible in future versions of AEM to use multiple content repositories, we would be able to separate generated content from application content. That way we could have all the stateless application data running in the container while the generated content resides safely in a Docker Volume. But until that happens, I recommend sticking to proven solutions for production and development servers for now.