Acknowledgments: I am grateful to Mary Mack, CEDS, CISSP, for her valuable feedback on an early draft of this piece.
At some point eDiscovery practitioners seem to have tacitly agreed that sample sizes for TAR validation testing should generally be taken using a 95% confidence level (CL) and, usually, a 2% to 5% confidence interval (CI). Often people want to review the lowest defensible number of documents for validation purposes, so with absolutely no regard to the specific parameters of their data they cling to a 95% CL, 5% CI whether they’re sampling for a control set or sampling for elusion testing. If this is your approach to TAR validation, please read to the end and reconsider!
This is a major problem in the industry. Over and over, I meet brilliant, competent experts who simply don’t understand what their preferred sample size really implies. I’ve even read white papers that belie their authors’ proclaimed expertise on validation.
I’m going to give some examples that should elucidate the consequences of the erroneous assumptions so many make, but first let’s touch on a Stats 101 overview of probability theory. It’s so important to fully grasp exactly what a given CL and CI are telling you about your estimates.
Confidence Level: In plain English, a confidence level is the probability that a sample is representative of your overall population. A 95% confidence level means that, on average, 19 out of 20 samples will give you accurate estimates. Keep in mind that you should expect that 1 out of 20 samples will result in an incorrect estimate.
Confidence Interval: Your confidence interval gives you a range that likely contains the true proportion of all the documents from which you’re drawing your sample. For example, 80% recall with a 5% confidence interval means that the software probably identified 75% to 85% of the responsive documents in the database. It’s important to note that in this scenario, for any given confidence level, it’s just as likely that 76% are responsive as it is that 82%—or any other number in that range—are responsive. I cannot emphasize this enough: you should not conclude that it is more likely that actual recall is 80%. With 95% confidence, it is just as likely 84% or 78%. For this reason, you should always make decisions about recall based on the lower bound of this range.
In fact, I always try to give my clients ranges, minimums, or maximums. I rarely give only the statistic, because it is virtually meaningless without a confidence interval. It’s also worth noting that the closer your proportion is to 50%, the more documents you need to review for a precise range. But that’s another article for another day.
It’s crucial to remember that we’re calculating sample size to estimate a proportion. This means that we cannot estimate recall and precision at 95% CL, 5% CI using a 95% CL, 5% CI sample of all documents in the database. (Unless more than 95% of all documents in the database are responsive—but why would we ever use TAR under those circumstances?)
Because precision and recall are proportions of only the documents designated as responsive by either the software or a human reviewer, you need to make sure you have enough responsive documents in your random sample to calculate useful estimates of these validation metrics. That’s not too onerous if 30% to 34% of your documents are responsive, but things get tricky and inefficient when only 1% to 3% of all the documents are responsive.
Consider the following examples to see how problematic these common misunderstandings can be.
Let’s imagine we’re starting with 100,000 unreviewed documents. The sampling feature of my review software gives me 383 documents for my control set using 95%, 5%. The associate on the case reviews them with perfect accuracy and identifies 123 as responsive, 260 as non-responsive.
Great! We can use our confidence interval to estimate that 27% to 37% of all documents in the database are responsive. Let’s train the TAR software and look at our recall and precision.
We find that the software categorized our control set such that we have:
92 documents categorized as responsive by both the attorney and the software | 39 documents categorized as not responsive by the attorney and as responsive by the software |
31 documents categorized as responsive by the attorney and as not responsive by the software | 221 documents categorized as not responsive by both the attorney and the software |
Remembering that recall and precision are calculated as:
Our statistics indicate 75% recall and 70% precision, so our sample size suggests we’re at 70% – 80% recall and 65% – 75% precision, right?
This is where I’ve seen a lot of very smart people go wrong.
Leaving aside questions about probability theory, which confidence interval formula to use, and mathematical proofs, our sample size and 32% rate of responsiveness does suggest around 27% to 37% of all the documents are responsive. However, since our sample size for recall is 123, not 383, we can only estimate with 95% confidence that recall is anywhere between 66% and 82%. Remember that it’s just as likely it’s really 67% as it is that it’s 79%. Are you comfortable defending the decision to produce only 2 out of every 3 responsive documents? Our precision estimate is also fairly wide at 62% to 78%.
We’d have to triple the size of the control set, and we still might not have enough responsive documents to calculate a 5% confidence interval, but we should probably be in good shape if we include at least 1425 total in the control set. While that’s quite a bit more than 383, at least it’s still a practical number. In this scenario it’s reasonable to conclude that a control set is appropriate for validation, but make sure it’s large enough and that you (or your data science consultant) are calculating the actual confidence intervals for your validation metrics.
Now let’s imagine the same scenario, but this time the associate coded with perfect accuracy and found that 8 were responsive and 375 were not responsive.
She then trained the model and we’re looking at:
6 documents categorized as responsive by both the attorney and the software | 3 documents categorized as not responsive by the attorney, but categorized as responsive by the software |
2 documents categorized as responsive by the attorney, but categorized as not responsive by the software | 372 documents categorized as not responsive by both the attorney and the software |
Now recall and precision are calculated as:
Same 75% recall and now 67% precision, but, with a 95% CL, our range is anywhere from 35% to 97% using a conservative estimate for recall, and our precision estimate is equally useless at 30% to 93%. We’d probably need around 9,500 to 38,000 documents in our control set to estimate these within our desired 5% interval!
It rarely makes sense to include so many documents in a control set, so in situations like these I note the proportion of responsive documents to help guide workflow, then count on elusion testing for validation.
If you, like others, are now wondering if we can simply calculate recall and precision based on all documents designated as responsive by humans or the software, including ones used to train the model, I have more bad news.
Of course we can plug numbers into the numerator and denominator to come up with a number. In fact, I should admit that I absolutely do that to guide workflow when I’m dealing with a very low proportion of responsive documents—but it’s not an appropriate method for validation and you should not assume you hit your target recall if this is the recall estimate you’re using. Why? Because you’re probably going to have issues with what we call an overfitted model—a common problem with machine learning. Both precision and recall are likely to be exaggerated if you calculate them on documents that were used to train the model, because the model is too specific to the examples you used for training and not generalizable to documents that have not been reviewed.
The good news is that elusion testing gives you a reasonable way to estimate recall, but you need to be just as careful with confidence intervals as you decide how many documents to include in your elusion sample. I’ll walk you through that in the second part of this piece.
[i] I use Clopper-Pearson intervals for the conservative ranges given here, but I prefer Wilson score intervals. The numbers will differ if you’re using the normal approximation, and there are reasons not to use that method. Please contact me if you have any interest in discussing or debating what interval to use—coffee is absolutely on me if it means meeting someone else who enjoys thinking about this in their free time.