Avansic: All Search Technologies Are Not Created Equal

Extract from Avansic’s article “All Search Technologies Are Not Created Equal”

Search Engines in eDiscovery

It’s helpful to understand that eDiscovery platforms have different search technologies under the hood. Historically, dtSearch was used due to affordability and ease of implementation. More recently, open source and free search technologies have become available. The most widely used is Lucene, which major companies use in eDiscovery and the general technology world.

dtSearch and Lucene

dtSearch, a constant in the market for years, is a closed source tool with most of the syntax having been widely adopted. Searching “within” by using “W/” is fairly well known; “pipe W/5 broken” means the user is looking for any mention of the term pipe within five words of broken. dtSearch also provides a robust set of default indexing settings with common noise words, vocabulary, and ignorable white space characters. These defaults are easily modified but note that complete indexing is required when a modification is made.

In contrast, Lucene, in its purest form, does not provide a set of default noise words, vocabulary, and ignorable white space; it requires locating those elsewhere. The search syntax in Lucene uses a markup language called JSON. Most implementations of Lucene-style indexes in the eDiscovery space mask this JSON search language with an interpreter that mirrors the syntax used by dtSearch, which looks and feels familiar to users. The primary benefit of using modern indexing like Lucene is in both the licensing cost and advanced features such as storing an index across multiple discrete servers. This increases the management cost for an index but allows more redundancies within the searching system. For example, an index server can be rebooted and still perform searches which is not possible with dtSearch.

