Extract from John Tredennick and Thomas Gricks’ article “The Importance of Contextual Diversity in Technology Assisted Review”
How do you know what you don’t know? This is a classic problem when searching a large volume of documents in litigation or an investigation.
In a technology assisted review (TAR), a key concern for some is whether the algorithm has missed important relevant documents, especially those that you may know nothing about at the outset of the review. This is because most modern TAR systems focus exclusively on relevance feedback, which means that the system feeds you the unreviewed documents that are likely to be the most relevant because they are most like what you have already coded as relevant. In other words, what is highly ranked depends on the documents that were tagged previously.
When you train a TAR algorithm using documents with which you are already familiar, or documents you located using a focused keyword search, the algorithm assumes you know the full scope of your review. The TAR tool assumes you generally know what topics, concepts and themes to look for.
But what about other relevant documents you didn’t find? Maybe they arrived in a rolling collection. Or maybe they existed all along but no one know to look for them. How would you find them based on your initial terms? When there are unexpected documents, concepts or terms in the collection, you could miss them simply because you don’t know to search for them.