I have recently been asked for a customer to deliver an analysis of available open source solutions as a replacement to a decade old proprietary enterprise search solution. In this article, I wanted to share the outcome of this analysis, and to give an outlook of how we perceive the future of such solutions. As a disclaimer, be aware that I am the CEO and cofounder of France Labs, the company developing Datafari, and this is important since Datafari was part of the solutions analyzed, and obviously this puts a potential bias on my considerations.
As an introduction, let me define what we mean by Enterprise Search in this particular context: it is about proposing a solution that can index many document sources within an information system, without prior knowledge about the working context (and yes, this is bad for UI optimization and relevancy), that allows employees to type in textual queries in a search bar, and that displays the search results as a vertical list together with facets. In addition, this must be done securely, which means it should come with connectivity to AD/LDAP for authentication, and with the capacity to respect the documents level permissions when users are searching (i.e. only display what users are allowed to see). Orthogonally to these functional aspects, Enterprise Search solutions must also provide administration capabilities, in terms of configuration and exploitation.
This means that technically, I was on the lookout for the following components: crawling frameworks, (to connect to document sources and prepare the documents for the indexing phase), indexing components (to analyze the contents and optimize the search index), and search components (to rapidly identify “relevant” documents and display it to the user). As orthogonal requirements, these components had to be manageable and secured.
Due to time constraints, I have only analyzed the following open source technologies: Apache Solr, OpenSearchServer, OpenSearch, Datafari and Apache ManifoldCF. Obviously, these choices were made because first, open source enterprise search solutions containing all of the expected components are quite rare, and second because I wanted to analyze at least one open source solution per component.
Jumping to conclusion, here is a summary of my findings:
- Apache Solr is focusing on the indexing and search aspects. It lacks a search UI (although you can connect it to Blacklight), it lacks crawlers, and does not natively care document level permission. So it is more of a building block than an enterprise search solution
- OpenSearchServer was meant to be an enterprise search solution. Notice that I am using the past tense: it has been a few years since it has been last updated. Plus, there is no documentation with regards to document level permissions.
- OpenSearch is the new kid on the block. It is a fork of the ELK stack (which has given up on the open source way in 2021). It is very active on ELK’s specialty, which is not document search but logs analytics. Being very recent, it has to be seen whether the Elastic community will shift to Opensearch or remain truthful to Elastic. Last, it lacks crawlers besides logs retrieval.
- Datafari claims to be an enterprise search solution. It is available since 2015 as open source, and covers the full spectrum, set aside the security aspects. Actually, the security aspects do exist, but are only for the Enterprise (proprietary) Edition of Datafari, not the open source one.
- Apache ManifoldCF is – to my knowledge – the only decent open source framework dedicated to crawling. And as a collateral of its charter, it means that it is lacking the search user interface, indexing and searching aspects. It does provide the possibility to fetch document level permission and to connect to AD/LDAP. Plus, it is well documented about how to create new connectors.
The description above is very short due to restrictions for this article, but it shows that no single open source solution can satisfy the full spectrum. The options available are then down to the classical “build or buy” decision. One can stay fully open source, and launch a project to assemble the different building blocks (for instance ManifoldCF and OpenSearch) and add the missing parts (UI and admin to follow up on the same example). Another can decide to go for Datafari Enterprise Edition to enable the missing parts (security for instance), at the cost of losing parts of the open source aspects. Not to mention that since an Enterprise Search tool is there to stay for many years, and needs to remain secure, should the “build” option be taken, extra care must be given to secure enough budget yearly to maintain and update the software. Between the lines, you can guess that at France Labs, we have made the bet that is way cheaper for an organization to get a yearly Datafari license than to set up an internal devops team. Why are we not proposing a fully open source stack ?
This question is opening another topic, related to the investment strategy and financial stability of open source vendors. Allow me to go back to the origin of my company: when we founded France Labs more than ten years ago, Datafari was not even a thing. We wanted to develop a search algorithm that would revolutionize the existing ecosystem. To achieve this, we needed to do trials “in vivo” on actual companies data, with a real search engine. Problem was that at the time, besides two small enterprise search projects lacking documentation and stability, there was no valid open source option. This is where Datafari originates from: to provide a valid engine to assess the quality of our algorithm. And what was meant to be a mean to an end, became a product on its own as we realized it had become a valid open source alternative to proprietary solutions (note that the R&D on the algorithm is still work in progress, Datafari currently uses the classical BM25 approach). This led us to 2015, with the release of the first version of Datafari.
This first version was fully open source, and the issue was that external consultants started proposing it to their customers, without partnering with us, thus killing the revenue stream we needed to fund the necessary R&D and maintenance costs. Opting for a “freemium” or “open core” approach allowed us to combine both worlds: we still provide an end-to-end open source solution, satisfying many scenarios, but it lacks additional components that will motivate enough customers to pay for the “extra mile” (the proprietary Enterprise Edition of Datafari), thus financing the future development of the product.
From an investment strategy perspective, based on our experience, we can conclude that unfortunately, against all the experts’ recommendations (for instance from the members of the Search Network), customers don’t want to put a budget or resources on regular optimization, and they want to reduce the internal maintenance workload to a minimum. This means that our heavy investment in advanced configurability and customization was not really useful, and more focus should be given on other topics such as the user interfaces.
These decisions led us to today, with a very diverse portfolio of customers, ranging from the army to aerospace, including also the banking industry, universities, nuclear, police forces. And obviously with a huge R&D roadmap ahead of us, to always propose a state-of-the-art solution.
To conclude this article, let me use my crystal ball for the future of Open Source Enterprise Search. Artificial Intelligence is in everyone’s mind. It is already well present in the document analysis phases: detecting objects in images, extracting entities from text for instance are already available (think OpenCV and OpenNLP for instance). It is also there for interacting with customers, with chatbots (for instance with the Open Conversation Kit). But for vector search (aka neural search), we are not there yet. End to end implementations are just starting to pop up (for instance in Apache Solr, one still to manually develop the vectors related to documents before injecting them in the index), but more importantly, there is still a strong limitation on the documents size: 512 words is the de facto standard. And while it is probably far enough for ecommerce scenarios, it is definitely very small when thinking about Enterprise Search with documents that can contain dozens of pages. Workarounds can be found, such as splitting documents. We are not there yet, but with a thriving community progressively creating the building blocks, I would bet that something will be available in the coming two to three years.
Cedric Ulmer CEO