Following up on our previous search-related post, which covered filtering and keyword searching as two ways to narrow or expand your data set, the next step is using thread and concept searching to reduce the data set. The overall result of eliminating thread duplicates and only selecting a unique set of concepts empowers you with greater insight into the performance of keyword searches and filtering activities.
De-Duplication, Threads, and Concepts
What does de-duplication have to do with any of this? How does de-dupe relate to threading and concept searching? Exact de-duplication is repeatable, reproducible, and based on well-known mathematics. Conceptual search and email threading both rely on artificial intelligence (AI).
While the goals of de-dupe, email threading, and concept indexing are all to identify groups of documents, whether exact, similar, or inclusive, the technologies used to calculate and interpret the results are significantly different.
De-Duplication in Practice
The method of de-dupe applied is an essential question to ask before searching. During processing, it’s key to de-dupe, so only one copy of an exact duplicate is searched. Whether the duplicate copies come from alternate custodians or sources, providing only one exact copy for threading, conceptual analysis, and search is important. In the past, all dupes were loaded to a system, and a filtering technique was used so only the primaries were indexed, threaded, or searched by concept. Now, most tools discard duplicates and update fields to indicate a document had duplicate fields, custodians, and other such items.
From a cost perspective, loading duplicates in a system that charges by the GB means you may be paying twice for the same document. Another reason to use exact dupe is to achieve the same results regardless of platform. Since each platform may have a unique algorithm to calculate email threading or concept search, differences may occur if a situation arises where alternate platforms are used. For instance, if two loose files share the same MD5 hash, most platforms would detect those as exact dupes regardless of their de-dupe technology. However, two emails grouped in the same thread based on metadata may be interpreted by other analytics engines and grouped into different threads.
The goal of email threading has evolved to detect if all content in one part of an email thread is included in a later part of that thread. This identifies the later email as inclusive, and the earlier email can be ignored since all its content is contained elsewhere. For example, an email with an attachment was sent by Mary to Joe, and he replied to her. The reply would not contain all of the original content because it would be missing the attachment. If, in the previous scenario, Joe received an email from Mary and forwarded it to Sue, that forward would contain all the original content because it would include the attachment.
These examples of common email operations do not necessarily manipulate the email threading calculation process. When a user modifies or removes the original content of a received email, the system should detect that the later email in the thread didn’t include all of the original content, such as when someone replies to a numbered list “in line” by putting their comments in the original author’s list.
In all these examples, the system should be able to calculate a thread and a tree showing the lineage of an email and its various branches through the use of metadata. Some tools may even be able to detect gaps in the email tree where emails are missing.
Search Improvements Using Threading
Using both the tree ID and the ID of inclusive emails can enhance the ability to locate documents or reduce the review burden. As we indicated in the previous blog post, it’s important to know whether you’re using search to try and narrow or expand your results.
The simplest example of threading improving search results is by limiting a search to inclusive emails because there won’t be multiple hits within a thread for an originating email. Imagine that the initial email in a thread of ten contains the keyword “John,” but none of the remaining email bodies contain that term. A search of the entire thread would be responsive because even though John only appeared in the first, a reviewer would see hits in every subsequent email. An inclusive search would mean you would only have to review one email. This does not work for date filtering, however, because multiple dates are represented in an email thread.
Using the other output of email threading – the tree – enables the coding of entire threads. Combining the two technologies allows for the production of discrete emails that were ignorable for the purpose of review without a reviewer having to put eyes on those documents. In that case, a reviewer flagged an inclusive email as responsive and not privileged, and therefore all the emails were ignored because this inclusive email can be produced without the need for individual review.
Concept search relies on technology that groups items together by their textual content. In most cases, a byproduct of concept search – indexing – is the calculation of near-duplicates. Near-dupes are similar to exact dupes except for a differing indexable character or words, and often a percentage calculation is provided. Near-dupes of 95% or higher can be reasonably assumed as versions of the same document. But knowing the optimal percentage threshold to use as a cutoff to ignore near-dupe items depends on the technology and the length of the documents. The longer the document, the higher the percentage which may be required. This may take practice and familiarity with a particular eDiscovery tool’s decisions regarding near-dupe identification.
Conceptual indexing takes this near-dupe idea one step further, identifying documents that have the same ideas but may not share similar vocabulary. For instance, one document may talk about the rules of the game of soccer (the name of that sport in America), whereas another is talking about the same rules but calling it football (as it’s called in the rest of the world). These documents would not be identified as near-dupes but would be identified as conceptually similar. A document describing the rules of American football and football as it’s known across the globe would not be identified as conceptually similar.
Note that in most conceptual indexing, a document is only included in the conceptual group where it is most similar to the rest of the documents; it would not be a part of several groups. Many platforms apply cluster visualization to conceptual groups, which presents larger groupings as more abstract ideas.
Search Improvements Using Concept Searching
Once documents are in near-dupe and conceptual groups, search can be enhanced in a few ways. The most obvious is to continue reducing the search set by eliminating near-dupes that share a high percentage. This is much like ignoring emails by identifying their inclusives. When searching the primary near-duplicates in a near-duped inclusive email set, the hits can be divided into the number of different concepts vs. the discrete number of documents with responsive keywords.
For instance, if the reduced set contains 10,000 items and a search term only returns a dozen items, it’s fair to say that’s a targeted search. If it returned 20% of the set, it might need to be refined, or the data is simply rich with that term. Rather than just identifying the number of hits, this type of searching identifies how many types of documents exist within the search hit set. Once the sets of documents to be reviewed are identified, it’s possible to review all those in the content group where the hits exist. This allows the identification of documents relevant to searches that don’t contain a particular keyword. In the previous example, if a reviewer searched for the word soccer, it would not return the rulebook where the game was called football. But reviewing the whole conceptual group would lead the reviewer to understand that it’s also called football. Every industry has acronyms, colloquialisms, and specific vocabulary; if a user is not knowledgeable about them, concept search can help.
The other major benefit of conceptual indexing is the ability to search by concepts rather than keywords. In some systems, this is called concept search; in others, it is called exemplar searching. In both cases, a reviewer provides phrases or paragraphs asking the system to identify documents conceptually similar to what has been provided. Obviously, the more data given, the fewer and more conceptually similar documents should be found.
A final aspect of conceptual search and indexing revolves around instances where an eDiscovery team may have no idea what they are looking for. Clustering helps clarify what’s in the set. First, documents are randomly sampled out of every conceptual group and reviewed. Then, concept groups that have responsive documents in the sample set should be reviewed in their entirety to identify additional responsive documents. This common approach is actually the beginning of the workflow for most technology assisted review (TAR) or predictive coding engines on the market. This technique is used by plaintiffs and defendants alike when they have a set of documents with unknown contents.
Keyword search and date filtering are ubiquitous and understandable to most users, and they generally produce the same results across platforms. Threading and concept searching can enhance and advance keyword search and filtering to improve overall search efficiency and effectiveness. Artificial intelligence-based email thread identification and conceptual grouping is not a new idea but has become simpler to use and easier to understand in modern eDiscovery platforms. Combining all these search technologies is the most effective way to reduce the review burden and locate the documents most relevant to your matter.