How (not) to break AEM

Anyone who works with AEM for long enough will agree on some key truths. These include “we shouldn’t have more than 1000 nodes on the same level” and “don’t commit big changesets in one transaction”, alongside many more. To be truly efficient developers, we need to distinguish between what’s true, and what’s just a myth or outdated statement. At AdaptTo 2018, Georg Henzler, a “myth buster” Solution Architect at Netcentric gave a talk on 7 Ways to Break AEM. In this talk, he tried to confirm or bust these myths, and so test how strong AEM is.

How do our breaking tests work?

To automate the process of testing and make it easier, Georg Henzler developed Breaklet. Breaklet is a groovy script which iteratively executes tests, handles a lifecycle headache and provides us with a bunch of useful methods. It’s used in every test, and, as will be shown later in this article, it’s easily extendable.

Before we started with our breaking tests and results, we needed to know the environment being used:

AEM 6.4 SP1 / OAK 1.8.3
4GB JVM Heap (JDK 1.8)
MacOS High Sierra 10.13.6
2,6 GHz Intel Core i7
32 GB 2400 MHz DDR4

Next, tools were used to generate test content and to gather information from AEM and JVM.

Netcentric AC Tool for groups and permissions
Groovy Console to create/run test setups
Headless Chromium
Visual VM
JMeter
Threads Health Check
Breaklets

7 Ways to break AEM

Maximum child nodes

This is probably the most common “myth”: that you should avoid having more than 1000 nodes on the same level, otherwise performance will be compromised. Even Adobe’s Performance Optimisation guide mentions this:

“The way a content repository is structured can impact performance as well. For best performance, the number of child nodes attached to individual nodes in a content repository should not exceed 1,000 (as a general rule).”

Therefore, it’s not surprising that this is usually checked first of all. Therefore, we started out by trying nt:unstructured (so orderable) node type. The scenario is as follows:

Create 1000 nodes
Save them
Reorder 1000 nodes
Save order
Traverse nodes
Query 10% of the nodes
Show all current nodes in CRXDE

We ran the Breaklet which stores results in file breaklet-result-TIMESTAMP.csv in AEM folder. Based on the generated csv file from Breaklet, we could then build the following chart:

As you can see, CRXDE performance degraded really fast. You wouldn’t want to experience such response time in real life. Obviously, on this chart, it’s impossible to visualize the performance of repository on node creating/saving/etc. So let’s exclude CRXDE from the chart:

In this case, time increases linearly and results do look fairly good. Following this, we could do the same with oak:Unstructured (not orderable) node type, excluding order related operations of course:

As you would expect, operations with unorderable nodes performed way better, so it’s best to deal with these when possible.

But let's get back to our myth. Is the “no more than 1000 child nodes” rule valid? The answer is more likely yes than no. CRXDE performs poorly with a large amount of nodes but it’s often used by developers and OPs for debugging, so we need it to work fast. All other operations do perform quite well, and it’s hard to tell the difference between 1000 and 5000 child nodes. Therefore, we can conclude that 1000 is somewhat of a magic number, though you should still check for yourself to determine the “max child nodes” limit for your system.

Changeset size in a single commit

Okay, here’s another real-life problem. Many projects do involve the migration of a large amount of data from the old system or expect a big amount of uploaded assets at once, during which we create a lot (sometimes more than a lot) of new nodes. To check this, we have another Breaklet with the next scenario:

Create a root node /content/nodetest
Create (1000 * iteration) nodes underneath
Save session
Wait for all change listener events
Delete root node
Save session
Wait for delete event

As we can clearly see, nodes creation (and not saving!) is the slowest operation here. So there will be no gain in splitting one operation of saving 5000 nodes into 5 operations of saving 1000 nodes. When running this Breaklet, we must remember that operations in it are resource consuming, so AEM should be provided with enough memory, otherwise OutOfMemmoryError will be thrown.

Changeset size in DocumentStore

When working with DocumentStore, there is an important thing to keep in mind about the number of changes in the single commit: transient space (which holds all changes in a single transaction) is kept in Heap Space, so all operations on it (including save()) do work fast. But when this transient space gets too large, OAK moves it to the storage, which brings performance penalties. The limit for the number of changes (so when it’s reached, transient space is written into the storage) can be configured via OSGi.

Maximum number of components on the page

Another very interesting aspect when working with AEM is how a number of components on the page affects the page usability for content editors. That’s where most of the content is created, so it’s very important to know AEM limits here.

The scenario in this Breaklet is as follows:

Create 50 text component nodes in a responsive grid
Save them
Open page in edit mode (browser)
Edit the last component on the page
Save

Before running tests, we made sure that property “sling.max.calls” in configuration “org.apache.sling.engine.impl.SlingMainServlet” was big enough to not get an error on too many requests. After that, we were ready to collect data and build our performance chart:

So, we can edit a component on the page with 50 components (quite close to real life) in around 15 seconds, and on the page with 2000 components in less than 1 minute. These produce fairly good results, though they mostly depend on the performance of the browser and client-side hardware, so results may vary a lot. However, now we have a way to quickly run the test on any system, right?

More ways to break AEM

So far, we’ve partially reviewed only 3 ways to break AEM. We strongly encourage checking out Georg’s talk at AdaptTo 2018 to discover more ways to break AEM (e.g. how the number of user groups affects performance and how many Resource type inheritance levels to use) as well as to access more detailed information about the processes listed above. Next, we will look into ideas not included in the talk that remain valid for anybody working with AEM.

Number of configurations until browser timeouts

After Georg’s talk, Jörg Hoh‏ stated on Twitter that it would be interesting to know how many OSGi configs we can have before browser ends up with a timeout. We like challenges, so we accepted this one.

The test was performed on the following system:

Vanilla AEM 6.4 with OAK 1.8.2
4GB JVM Heap (JDK 1.8)
MacOS High Sierra 10.13.6
2,8 GHz Intel Core i7
16 GB 1600 MHz DDR3
Chromium 70.0.3508.0 in headless mode for tests

To conduct this test, we wrote our next Breaklet. All steps below were performed iteratively:

Create 1000 OSGi configs for an active runmode
Save them
Wait until they are processed by OSGi Installer
Open Configuration Manager page
Open last created configuration

As different browsers have different predefined values for page load timeout, we configured Breaklet to fail on any DOM load time bigger than 120 seconds. Below you can see what we received:

As you can see, the DOM content load time increases linearly and Breaklet failed at around 19.000 OSGi config. This number is “a bit too much” for the regular project, so we can be sure that our Configuration Manager console will load fast enough with any reasonable amount of configurations.

Homework (number of workflows to break AEM)

One of the tasks from “homework” that Georg gave to the audience was “maximum number of workflows”. Back in AEM 5.6.1 days, I ran into the issue that deployment to the QA instance was taking one hour instead of regular 5-10 minutes. A lot of time was spent on the investigation but no cause was found. So we’ve decided to do a general system cleanup, and the first step was to clean up active and completed workflow instances. As soon as it was done, we decided to re-run deployment one more time and, to our surprise, it passed really fast. Since then, I always keep in mind how important workflow cleanup is.

Workflows can impact systems in different ways - from indexing issues to general system slowdown. But let’s keep it simple and just check how long it takes to create a new workflow and how the Inbox page handles a large amount of messages (produced by workflows).

So the scenario is as follows:

Create 1000 “Request for Activation” workflow instances
Wait until they all are started
Open Inbox page
Wait for inbox items to be loaded

If we run the Breaklet which implements the scenario above, we get the following results:

This demonstrates that the time needed to start workflow programmatically doesn’t depend on the number of existing active workflows (except a couple of relatively minor spikes). Time to load Inbox page in Touch UI doesn’t depend on the number of workflows either. This is different to Inbox in Classic UI, where JSON with all the inbox messages is loaded (checked this on vanilla AEM 6.4). Even though we weren’t able to break AEM with this simple scenario, remember to clean up workflows regularly in order to prevent big problems with general AEM performance.

Conclusion

A ‘good developer’ believes in facts and doesn’t blindly trust myths. In this article, we’ve brought together a fun exercise in breaking AEM, an initiative to dispell the myths around it (such as the Max changeset size for TarMK) and a greater understanding of the limits of the system. By doing that, we’ve identified areas for caution and, on the other hand, areas where there’s no need to limit ourselves. Hopefully this work brings us one step further to being a ‘good developer’.