How to tune the Lucene index within AEM is a multistep process. Learn what you need to do to optimise your index.
The index is a core component and one that should be quick in answering any queries you fire at it. In many cases, it also grows significantly and sometimes it’s unclear why it is growing or slow. How to tune the Lucene index within AEM is a multi step process, which is briefly described in this blog article. Additionally, we will look at query analysis and traversals, and how to mitigate them to ensure that queries run fast.
When it comes to running an AEM instance in production, the index is always present. Adobe recommends some steps to tune the index further. Some are listed below, but also further extended information and documentation can be found, as the steps may not always be clear. Before trying anything like this on a production instance, make sure you do the testing on an exact copy, as this allows you to gather important information on how long the tasks take and also permits you to calculate for downtimes properly. If you are doing the tuning on a publish instance and have multiple ones to do this on, and you are running in a cloud or cloud-like infrastructure with fast snapshot capability, I recommend that you perform this on one publish, buffer the replication nodes temporarily for the others, and clone the instance. This will save much time, and in the end, you will achieve a much more equal setup.
Details on the functionality and architecture of the index can be found here
To check if the index is working properly
org.apache.jackrabbit.oak: "IndexCopier support statistics" ("IndexCopierStats")
org.apache.jackrabbit.oak: "async" ("IndexStats")
org.apache.jackrabbit.oak: "Lucene Index statistics" ("LuceneIndex")
The Lucene index statistics will start showing separate indexes once the initial indexing is finished.
In many cases traversals can occur if an index is not deemed to be useful to use. This can be quite a performance killer and, therefore, it is advisable to create custom indexes when needed. A great tool to use is the index generator. Simply create an index configuration from your xpath query and deploy it.
You can set a property on an index node as follows:
This can be set on the Lucene or also the cqPageLucene index.
Additionally, ensure the value rep:Token is added to the declaringNodeTypes properties (multi string property) for the nodetype index.
Double check your index config. Especially if there are duplicates for aggregation, there is a chance that this is wrong. The root cause for this misconfiguration at this time is unclear, but removing them and changing it to default behaviour reduces the size drastically.
Below is a sample index config which applies many exclusions:
<!-- Disable package extraction as it's too resource-intensive -->
<!-- Disable image extraction as there's no text to be found -->
<!-- Disable Office -->
<!-- Media -->
When the CopyOnRead is enabled, it only gets extended but never reduced. This is the reason why a deletion of the index on the file system is needed, so the tuned index can be copied to disk.
To reduce size of the Lucene index, the above steps must be followed including the adjusted tika config. This is time consuming and possibly, if you want to reduce the size of the index on the filesystem for a publish, it may be easier to simply disable the CopyOnRead and CopyOnWrite functionality. Check with your architect if the indexes are heavily used on the publish side or not before!
The following JVM parameters have shown much more stable performance on production systems. Ensure that these JVM parameters are in the AEM start script to prevent expensive queries from overloading the systems.
There are many things to look out for when operating an AEM instance. Not only regular maintenance tasks like a Datastore Garbage Collection or others as detailed in a previous article are important. Tar compactions are the key to reducing repository growth. Datastore garbage collections after a compaction yield the best results in reducing the datastore size. Also as noted above, optimizing your Lucene index is a good solution to further reducing the resource footprint. Additionally, it does make sense to disable specific configurations like the CopyOnRead or CopyOnWrite where they are not needed, for example, on Publish instances where search is not leveraged much.