Executive Summary
- When estimating recall (the percentage of responsive documents identified by a review process) in a TAR 2.0 document review, the only sample used is often an elusion sample (a sample of unreviewed documents, i.e., the null set).
- While the size of an elusion sample can be set to achieve a specified degree of statistical confidence, e.g., a 95% confidence level with a 5% margin of error for the estimate, reviewing an elusion set alone can overestimate recall.
- A better way to estimate TAR 2.0 recall samples is to draw them not only from the null set but also from the documents that have been reviewed and coded as responsive and from those that have been reviewed and classified as non-responsive.
- AI[1] may better enable requesting parties to demonstrate the inadequacies, i.e., unreasonableness, of productions, which are more likely to occur in the absence of adequate validation, such as when only an elusion sample was taken.
TAR 2.0 – the Reigning Champion of Production Review
TAR 2.0 is likely the most widely used analytic technology for reviewing large document collections for production (although AI may eventually overtake it, at least to the extent that parties are willing to produce documents without any further review other than for privilege).
In general terms,[2] a TAR 2.0 (a/k/a CAL, continuous active learning) process, such as Relativity Active Learning,, builds a predictive model that scores documents in a collection based on how likely they are to be responsive, as determined by the model’s training. Once training the predictive model has started, the TAR platform queues up documents for human reviewers based upon those scores, prioritizing review of the highest-scoring unreviewed documents.
Human reviewers code, i.e., classify, each presented document as responsive or non-responsive, and those coding decisions are fed back into the model, which iteratively rescores the documents in the document collection to which TAR is being applied.
Training and review continue until a stopping point, such as a substantial drop in the percentage of responsive documents served to reviewers, is reached, and validation is performed.
By the end of a TAR 2.0 process, every document produced (or withheld for privilege) will have been reviewed and classified by a human reviewer as responsive (“reviewed as responsive”) or be a family member of a document classified as responsive.
The balance of the collection will contain two classes of documents:
(1) documents that were presented to human reviewers, reviewed, and coded as nonresponsive by them (“reviewed as nonresponsive”), and
(2) documents that were never served to the human reviewers because the TAR 2.0 platform’s predictive model deemed them nonresponsive when training stopped (“unreviewed documents”, sometimes called the “null set”).
Elusion Sampling Alone Can Fail to Accurately Estimate Recall
From the perspective of a requesting party, the most crucial validation metric is recall, the percentage of responsive documents identified through the producing party’s review process.[3]
Frequently, for TAR 2.0 productions, recall for the TAR 2.0 process has been estimated using only the count of unreviewed documents and a sample from the null set, known as an elusion sample.
When the only sample taken is an elusion sample, recall is calculated by dividing the number of documents reviewed as responsive by the sum of those reviewed as responsive and the estimated number of responsive documents in the unreviewed documents. The estimated number of unreviewed responsive documents is calculated by multiplying the number of unreviewed documents by the percentage of responsive documents found in the elusion sample.

This entire elusion validation edifice rests on the foundation of two often unstated or unexamined assumptions, viz., that (1) all documents classified as responsive by the human reviewers during the review process were accurately classified as responsive, and (2) all documents classified as non-responsive by the human reviewers during the review process were accurately classified as non-responsive.
Both assumptions are problematic: there will virtually always be some non-responsive documents classified as responsive and some responsive documents classified as non-responsive. Both types of misclassification would increase the estimated recall.
Sampling from the Full Population Surfaces Weaknesses with an Elusion Sampling Only Validation Approach
If such misclassifications are not estimated and considered, the connotations of statistical rigor and accuracy conjured up by the confidence level and margin of error of an elusion sample are questionable – and perhaps even an illusion, depending on the level of erroneous coding in the original review.
Consider this hypothetical example: a population of 5,000,000 documents, of which 250,000 had been reviewed as responsive, 750,000 had been reviewed as non-responsive, and 4,000,000 (the null set) had not been reviewed
| Reviewed as Responsive | Reviewed as Non-Responsive | Unreviewed | Totals | |
| Population size | 250,000 | 750,000 | 4,000,000 | 5,000,000 |
| Sample size | 500 | 500 | 2,000 | 3,000 |
| Validated as responsive | 450 | 200 | 40 | 640 |
| Estimated total responsive document count | 225,000 | 300,000 | 80,000 | 530,000 |
Using the counts above, if only the null set was sampled in validation, the estimated recall would be an arguably respectable 76%, calculated as 250,000 (documents reviewed as responsive) divided by 330,000 (the sum of 250,000 documents reviewed as responsive and 80,000 estimated unreviewed responsive documents).
But when the documents reviewed as responsive and those reviewed as non-responsive are also sampled and included in this hypothetical recall estimate, i.e., when a stratified sample, as in the landmark In Re: Broiler Chicken ESI order authored by Dr. Maura Grossman as a Special Master,[4] which used the same sample sizes from the same three sub-collections as the hypothetical above, is used, the estimated recall is more than halved, dropping to a plainly inadequate 37%. This is calculated by dividing 225,000 (the estimated responsive portion of the documents reviewed as responsive) by 605,000 (the sum of 225,000 (the estimated responsive portion of the documents reviewed as responsive) + 300,000 (the estimated responsive portion of documents reviewed as non-responsive) + 80,000 (the estimated number of unreviewed responsive documents).
Such a drastic difference in these recall estimate may result from such factors as:
- changes in the coding instructions given to human reviewers over the course of a TAR 2.0 review,
- applying unchanged coding instructions to documents added to the TAR review which were substantially dissimilar to the documents originally in the collection,
- reviewer turnover, and
- bias engendered when validation reviewers know that all the documents that they are reviewing were originally deemed to be non-responsive.
A 2024 study by Dr. Grossman and four colleagues (the “Comparison Study”)[5], based on a real-world review of a 169,000-document subset of a collection of over 300 million documents, compared various metrics, including recall, using a stratified validation (called a Confusion test in the study, as in the example above) versus estimating recall from an elusion sample alone, a methodology provided by another platform, which, while unnamed in the paper, was clearly Relativity. The comparison showed that Relativity’s elusion-based methodology overestimated recall in some instances by 29% (65% vs 94%), 20% (69% vs 89%), 16% (24% vs 40%), 25% (42% vs 67%), and 36% (55% vs 91%). Note that in some of these instances, the more accurate estimated recall calculated using stratified sampling was simply too low to be acceptable.
According to the Comparison Study, the “critical difference that renders [stratified validation sampling] more reliable and accurate than an Elusion test is that it examines predicted and actual results, which incorporate false negatives and false positives into recall calculations[6], whereas the Elusion test assumes that all documents marked as relevant on the first pass are accurately coded, which is pure fiction.” (Emphasis added.)
A June 2026 study by Redgrave LLP[7] comparing recall and precision between Relativity’s AI (aiR for Review) and TAR 2.0 (Relativity Active Learning) platforms in a low-richness population in a complex federal regulatory matter found that:
The first-pass RAL workflow’s TAR tool reported 100% recall based on an elusion sample drawn from its own discard pile. [The aiR for Review process] measured 64% recall through a blinded expert review of a random sample drawn from the full population. The figures are not contradictory; they answer different questions.
The TAR tool’s number comes from sampling the algorithm’s discard pile [the null set] and checking whether any of the discarded documents were actually responsive. None were, so the tool reported 100% recall—an accurate statement about the discards. Our 64% comes from a different vantage point: a blinded review by the subject-matter expert of a random sample of every document in the collection, regardless of where the workflow had routed it. It asks how many of all responsive documents the workflow surfaced. The two diverge because the discard pile is not the only place a responsive document can be lost. In an active-learning workflow, the algorithm presents documents to contract reviewers, who code each as Responsive or Not Responsive; the algorithm then trains on those codings. When a reviewer wrongly codes a truly responsive document as Not Responsive, the document drops out of the production set, and the algorithm treats the label as ground truth for its next round of training. Elusion sampling cannot detect those reviewer-caused misses because it samples only from the algorithm’s discards. The 64% population recall therefore reflects the combined performance of the contract-reviewer team, vendor QC, and calibration process—not a property of the algorithm in isolation.[8]
Uber MDL and Insulin MDL Courts Take Note and Reject an Elusion Sample Only Approach
In In Re: Uber Technologies, Inc. Passenger Sexual Assault Litigation,[9] the court rejected Defendant’s proposal to use only an elusion sample to validate TAR, instead requiring samples of the same size from the same three populations as set out in the hypothetical above.[10]
Likewise, in In Re: Insulin Pricing Litigation,[11] Magistrate Judge Singh mandated the use of a stratified sampling approach over Defendant’s proposal based on an elusion sample:
“On this record, the Court cannot find that [Defendant’s] proposal is sufficiently reasonable to validate the use of its TAR model. To be clear, the Court is reluctant to force a responding party to adopt validation metrics imposed by a requesting party; the premise that a responding party is best equipped to determine its search methodology remains true even as technology evolves. Nevertheless, [Defendant] must still satisfy the Court that its approach to TAR validation is reasonable. To do so, the case law and experts in the industry appear to recommend, if not require, a statistically sound methodology. Here, on this limited record, [Defendant] has not met that burden. Rather, the weight of the available authority raises concerns that adopting [Defendants’] proposed methodology would result in opaque and potentially unreliable recall calculations.
“Accordingly, [Defendant] will validate their TAR model consistent with the validation process and recall calculation ordered in in re Broiler Chicken, 2018 WL 1146371, including a sampling from the full TAR document universe.”[12]
AI May Revolutionize Receiving Party Evaluations of Document Productions
Decades ago, as a litigation associate at Cravath, Swaine and Moore, I worked on two major IBM antitrust cases, including the U.S. government’s antitrust suit accusing IBM, inter alia, of monopolizing the support and leasing of IBM’s revolutionary System/360 line of computers, and a copycat private action brought by Greyhound Leasing, one of several leasing companies that bought System/360 computers at IBM’s retail prices and then leased them at prices lower than IBM’s, in essence betting that they understood the relationship between IBM’s own retail and leasing prices, which were based on the estimated useful life of the computers, better than IBM did. As it turned out, they didn’t, and they then sued IBM, tracking the government’s allegations, for their losses.
The government’s case threatened IBM’s very existence, and accordingly, IBM waged total war against it, sparing no expense in its defense. IBM deployed hundreds of employees dedicated to supporting Cravath’s substantial team of attorneys.[13]
Under the direction of Cravath partner Tom Barr, the Cravath attorneys thoroughly researched every possible nook and cranny that they thought might affect the case.
Commenting on the scale of Cravath’s efforts and the attendant costs, Nicholas Katzenbach, the famed former Deputy Attorney General who served as IBM’s senior vice president and general counsel, was famously quoted as saying that he gave Tom Barr an unlimited budget and that Tom exceeded it.[14]
Such unlimited efforts have been beyond the reach of most litigants. But AI may change that, enabling teams of attorneys much smaller than the IBM and Cravath legions to perform the same types of comprehensive analysis at a fraction of the cost.
AI will enable much smaller teams of attorneys to analyze the document productions they receive with comparable rigor, examining not only what was produced but, perhaps more importantly, uncovering what wasn’t, to demonstrate the inadequacies of a defendant’s productions stemming from inadequate validation.
Depending on the capabilities of the AI used, such AI-powered analyses could include:
- analysis of production metadata to identify suspicious date range gaps within custodians and overall, volume anomalies compared to other similar custodians, missing metadata required under an ESI protocol, etc.;
- comparing RFPs against productions to identify requests that received no responsive documents, vague objections with no production, or categories where volume seems implausibly low;
- analyzing privilege logs for entries that don’t match the claimed privilege (e.g., no attorney on a “attorney-client” entry);
- identifying email participants who sent or received relevant communications but aren’t listed as custodians;
- identifying documents, reports, or studies referenced but not included in the production (“see attached,” “per my earlier memo,” “as the 2019 study showed”);
- Identifying internal systems, databases, or repositories mentioned that weren’t addressed in the ESI protocol or identified elsewhere by the defendants.
- Comparing a defendant’s production against public records and third-party productions such as SEC filings, regulatory submissions, press releases, third-party subpoena returns, etc., to identify documents that should exist in both but appear only outside the production.
In short, welcome to the future.
[1] “AI” as used in this article refers to both Gen AI and agentic AI.
[2] Every TAR platform has its own specific processes and idiosyncrasies; for example, particularly at the early stages of review, some platforms may also serve up for review documents with uncertain scoring, or first cluster documents by topic and then serve up documents from every topical cluster even if some have low scores.
[3] While recall can be estimated for a single stage of a producing party’s process, such as a TAR 2.0 review, multiple stage reviews, e.g., search terms followed by TAR, require that recall be estimated on an end-to-end basis, e.g., what percentage of the responsive documents in a document population remained after the population had been winnowed by search terms and then by TAR. In multi-stage review processes, end-to end recall would the product of the recall of each process. For example, if search terms had an estimated recall of 50% (a generous estimate) and TAR had estimated recall of 70%, end-to-end recall would be a paltry 35% (50% x 70%).
[4] In re Broiler Chicken Antitrust Litigation, No. 1:16-cv-08637, 2018 WL 1146371 (N.D. Ill. Jan. 3, 2018), ECF 586 (01/03/18), p, 7.
[5] O’Halloran, T., McManus, B., Harbison, A., Grossman, M.R. and Cormack, G.V., 2024, March. Comparison of Tools and Methods for Technology-Assisted Review, in International conference on information management (pp. 106-126). Cham: Springer Nature Switzerland. An online prepublication version is available at https://ediscoverytoday.com/wp-content/uploads/2024/01/Comparison-of-Tools-and-Methods-for-Technology-Assisted-Review.pdf (most recently accessed on March 23, 2026).
[6] In this context, a false positive is a non-responsive document miscoded as responsive, and a false negative is a responsive document miscoded as non-responsive
[7] Robert D. Keeling, Ray Mangum, F. Eli Nelson, & Kevin A. Reiss, Generative AI
for Complex Document Review: A Comparative Evaluation Benchmarked Against Active Learning and an
Independent Expert Reviewer, Redgrave LLP Working Paper 2026-01 (June 2026) (“Redgrave Study”).
[8] Redgrave Study, pp. 9-10 (emphasis added).
[9] In Re: Uber Technologies, Inc. Passenger Sexual Assault Litigation, MDL No. 3084 (N.D. Cal.).
[10] In Re: Uber Technologies, Inc. Passenger Sexual Assault Litigation, Case 2:23-md-03080-BRM-RLS, ECF 524 (05/03/2024).
[11] In Re: Insulin Pricing Litigation, MDL No. 3080 (D N.J.).
[12] In Re: Insulin Pricing Litigation, Case 2:23-md-03080-BRM-RLS, ECF 503, (April 11, 2025), p. 13.
[13] One major task of the IBM employees using Aquarius, IBM’s STAIRS database, arguably the start of modern litigation support technology and the inspiration for the Concordance litigation review program, to identify documents responsive to attorney requests
[14] After 13 years of litigation, Cravath and IBM’s comprehensive effort succeeded when William Baxter, my antitrust professor at Stanford Law School, appointed as the Assistant Attorney General for Antitrust in the incoming Ronald Reagan administration, brought a fresh eye and rigorous economic analysis to the government’s case and simply dismissed it. (The Greyhound case was settled on the eve of trial.)

