We can divide the process in three separate blocks: indexing acls (remember our good old friend the allowAcl field), here represented in green, managing an app to provide those user acls (represented in orange) and modifying the query to include user acls (represented in blue).
Indexing ACLs
Each time we index a document, we need to add the allowAcl property to it. This property will contain the list of groups and users that have read access to the document. But getting the ACLs of a document is a process highly dependent on the content store where the documents live in. If we look at Netcentrics case, the AEM environment and the JCR repository make it reasonably easy for us:
In an AEM application, changes in permissions can be performed only in a few spots, by placing some event listeners in some clever locations, we are able to keep track of changes in permissions. When a change in a permission of a document is detected, we essentially recalculate its allow and deny ACLs and perform a reindex of the document.
The minor details of the implementation of this process are very much out of the scope of this article, so let’s move on!
The filtering app
The permissions each user has will change frequently, the filtering app is in charge of keeping an updated record of the permissions each user has. To achieve this, the filtering app will periodically request a file with the ACL information of the whole system to the Permission-extractor module (part of the AEM Connector). Once the extractor gives the file to the filtering app, the file will be parsed, the groups will be expanded, and the new permissions for each user (if any) will be saved.
At this point, you might be wondering something like: the groups will be expanded... What is that mean?
Let’s say I am a member of two groups: A and B. That means → Andres is member of [ A, B ]
Ok, fair enough, but remember that groups can be members of other groups. Now let’s say that group A is member of both group C and D. Also, Group C is member of group D. Giving us the following acl lists:
Andres is member of A, B
A is member of C, D
C is member of D
As you can easily see, we have a nested structure of group memberships, and we need to expand that structure.
After a simple recursive algorithm, we have the result of the group expansion in this example:
Andres is member of A, B, C, D
You might also be thinking about that extractor thing: the extractor gives the file to the filtering app. How does it get that system ACL file in the first place?
The extractor relies on Netcentric's ACL tool to generate the file with the permissions of the system. The ACL tool is an awesome open source project that can be found here.
Modifying the query
When the user queries a text the search index triggers a query pipeline. It will have a number of steps, for example, adding default query params, authenticating the query, etc. One of those steps will be responsible for adding document-level security to the query. It’ll do this by calling the filtering app to get the permissions of the user doing the query. And then it will add the results of that call to the query, before sending the final query to the search index.
What you now know about Secure Document Search
In this article we have talked about all the major building blocks involved in Secure Document Search. What we want to achieve, why is it a complicated problem, late binding vs early binding, a general overview of architecture, and some implementation details.
As you can see, this is not an easy topic. Solutions depend heavily on project specifics. There are a lot of different types of ACLs and content stores. A good solution requires very good error handling and the whole system needs to be highly available at all times, etc.
To conclude, I just want to thank you very much for making it to the end :)
I truly hope this read wasn't too boring, and I hope it helps you understand Secure Document Search a little bit better.