
Extract from Cristin Traylor’s article “How to Scale Defensible Generative AI Results for Document Review”
Just as with the technology-assisted review (TAR) methods that came in the two decades before its release, two fundamental questions remain at the heart of any conversation about whether generative AI is a viable option for document review for production or other targeted document requests:
- First, is it accurate?
- And second, is it defensible?
When TAR first became available, the defensibility of its use in litigation was rigorously scrutinized and debated. But a series of judicial decisions established it firmly, thanks in large part to the validation metrics that practitioners used to measure the accuracy of its results.
These statistical metrics included:
- Recall: the percentage of all relevant documents in a population that the AI correctly predicted to be relevant. A higher value is better.
- Precision: the percentage of documents in a population that the tool predicted are relevant, that are truly relevant. A higher value is better.
- Elusion: the percentage of documents predicted to be not relevant, that are actually relevant. A lower value is better.
- Richness: the percentage of all documents in the collection that are relevant. Higher or lower values aren’t better or worse, but a lower value often requires larger sample sizes for validation testing and may result in a wider margin for error.