Across the legal landscape, lawyers search for documents for many different reasons. TAR 1.0 systems were primarily used to classify large numbers of documents when lawyers were reviewing documents for production. But how can you use TAR for even more document review tasks?

Modern TAR technologies (TAR 2.0 based on the continuous active learningor CALprotocol) include the ability to deal with low richness, rolling and small collections, and flexible inputs in addition to vast improvements in speed. These improvements also allow TAR to be used effectively in many more document review workflows than traditional TAR 1.0 systems.

In addition to classification for production reviews, the types of review tasks lawyers typically face fall generally into two other categories:

  • Knowledge Generation. The goal here is learning what stories the documents can tell us and discovering information that could prove useful to our case. A common example of this is searching and reviewing documents received in a production from an opposing party or searching a collection for documents related to specific issues or deposition witnesses.
  • Protection. This is a higher level of scrutiny in which the purpose is to protect certain types of information from disclosure. The most common example is privilege review, but this also encompasses trade secrets and other forms of confidential, protected or even embarrassing information, such as personally identifiable information (PII) or confidential supervisory information (CSI).

Our goal is to get you thinking beyond outbound productions so you can take advantage of the many other things CAL can do.

In addition to outbound productions, Catalyst has effectively handled more than 350 diverse reviews in the past two years using Insight Predict (our TAR 2.0 technology) including investigations, opposing party reviews, depo prep and issue analysis, and privilege and privilege QC.

Here, we explore the various techniques for implementing a CAL review for investigations.

For any given task using Predict, including investigations, there are some fairly standard approaches that work well and take full advantage of the strengths of CAL and the two companion algorithms comprising Predict, contextual diversity and algorithmic QC. We discuss these more below.


Most investigations, whether they are internal investigations, regulatory investigations, or even investigations in anticipation of litigation, are true knowledge generation tasks. The primary objective is to find the critical documents on all the principal issues as quickly as possible. There is no need to find every document—just enough of the key documents to fully understand all of the underlying issues.

So, recall is not crucial; precision and coverage are. And unlike litigation, there are no fact-laced complaints or prescriptive requests for production to focus the search for pertinent documents. An investigation oftentimes begins with, at most, a handful of informative documents, but more often nothing more than vague assertions of some assertedly actionable conduct.

Single seeds: The paucity of exemplar documents is not an impediment to an efficient and effective investigation review using CAL. Our studies have shown that even a single positive document can be used to quickly locate the majority of the pertinent documents in a collection.

Synthetic seeds: And that starting document may be a synthetic seed that doesn’t even exist in the subject collection. A synthetic seed is an exemplar document that is created from whole cloth to reflect the key language likely contained in the positive documents within the subject collection. And it can take pretty much any form. A synthetic seed can be a prose recitation of the facts that are expected to underlie the entire investigation. Or it can simply be a compilation of the keywords (or other features, such as bigrams and trigrams) that are likely to be contained in any positive documents. Whatever form it takes, the document is simply added to the subject collection and marked positive, and a CAL tool will immediately start to elevate similar (likely positive) documents to the top of the ranking for early review.

To illustrate how this can work, we simulated a small investigation to see how quickly Predict could locate the positive documents in a very sparse collection, using only a single document to initiate the ranking. The collection consisted of roughly 4,600 documents, of which only 55 were positive (for a richness of 1.2%). The yield curves for two simulated reviews using (1) a single positive document, and (2) a single synthetic seed are shown below.1 As the chart illustrates, Predict was able to locate 65% of the positive documents after review of fewer than 130 documents in both cases. That equates to reviewing less than two documents to find each positive document, and a review of less than 3% of the collection.2

While every collection differs, the prevailing utility of Predict and CAL to investigations is obvious. Positive documents can be elevated for review very quickly, and with very little information or advance preparation.

There is an additional benefit to using CAL for investigations, that is inherent to the operation of many CAL protocols. CAL is proficient at quickly elevating documents pertaining to the broad range of issues that may be subsumed within an investigation review, without having to focus on any given issue independently. That means that a CAL review will not only elevate positive documents for review quickly, but will do so across a broad spectrum of pertinent issues.

Aspectual Recall Simulations

We simulated this CAL characteristic as well, confirming a study that was first reported in 2015.3 We evaluated a collection consisting of 521,669 documents that was nearly 42% rich. The collection had been evaluated for responsiveness, as well as 13 substantive issues, which themselves ranged in richness from 0.13% to 27.7%.

We simulated a responsiveness review, and simultaneously noted the discovery of documents relating to each issue, charting the results as a group of yield curves as shown below. This chart shows that a Predict review for generally positive documents (blue line) will quickly elevate documents covering a wide range of topics at issue (green lines). In this example, by the time the general review had reached 40% of the positive documents, documents for every related issue had been discovered and reviewed.

We saw the same results for documents that had been coded as “hot” (red line) in that collection. In conjunction with the general responsiveness review, we also noted the discovery of hot documents, which we again charted together as yield curves. Consistent with the ability of CAL to elevate documents relating to the various topics at issue in the review, CAL also quickly elevated the important, hot documents for early review and analysis.

The importance of using CAL for investigations to take advantage of this capability is again fairly obvious. The ability to get to the most critical documents first, and to see documents spanning all of the underlying issues, will make an investigation both quick and complete.

Contextual Diversity

One of the ancillary benefits of using Predict for an investigation is the ability to continuously search for unknown topics or concepts that might exist in the collection, through the continuous application of contextual diversity, one of the companion algorithms to Predict. Contextual diversity is constantly ranking the entire collection by how much each document actually differs from what has already been seen and coded.

These “unknown” documents are continually batched for review to ensure an even deeper penetration into the collection than is otherwise attained by the general ability of CAL to ferret out all the peripheral underlying issues. And the ratio of contextual diversity documents in the review batches can be adjusted over time, as the focus of the review moves from a more complete understanding of the known issues, into an exploration of the unknown issues.

Proving a Negative

A corollary to the knowledge generation investigation task is the ability to use Predict and contextual diversity to demonstrate the absence of any documents relating to a particular inquiry—at least to a statistical degree of certainty. This would typically arise in situations such as Supplemental Second Requests, where the body of responsive documents has already been depleted. In that situation, rather than review the entire collection to find nothing, the objective is to review only a sufficient number of documents to adequately demonstrate that there simply are no more responsive documents. Certainly, Predict and contextual diversity are only two of the weapons in the arsenal needed to “prove a negative,” but they are critical to minimizing the review that will be necessary to achieve appropriate statistical levels.

Since proving a negative is a statistical undertaking, the first step is to set the statistical parameters (confidence level and confidence interval) that will support your conclusion that there are not enough documents in the collection to justify a full review. The confidence level and confidence interval will establish (1) the number of documents that will need to be reviewed and (2) the margin of error for the review. With the margin of error, the maximum number of positive documents in the collection can be estimated, and proportionality considerations can be quantified.

There is no hard and fast rule for setting the statistical parameters for the review, and there are really two approaches. First, setting the size of the ultimate review will establish the statistical parameters—that is, the confidence level and confidence interval. For example, reviewing 5,000 documents will, in every instance, ensure a margin of error of less than 1.4% at a 95% confidence level, and less than 1.9% at a 99% confidence level.

Alternatively, setting the confidence level and confidence interval will establish the size of the review. So, it will take a total of roughly 2,400 documents to ensure a maximum margin of error of 2% at a confidence level of 95%, while it would take roughly 4,200 documents to achieve that same margin of error at a 99% confidence level.4 In either case, the relative cost and benefit of various sample sizes can be evaluated, and the number of documents to be reviewed can be negotiated and set accordingly.

Before implementing CAL, advanced analytics should be used to review between 20% to 30% of the total number of documents to be reviewed, in an effort to find positive documents. Every reasonable technique should be used: carefully crafted and targeted keyword searches; file type analyses; custodian and timeline analyses; communication analytics; random sampling; etc.

Since no analytics approach is likely to locate any responsive documents (because none are expected to exist in the collection), the entire review should focus on finding documents that are contextually close to being responsive. These “close” documents will eventually serve as the best available training examples for the CAL review.

Once the analytics review is complete, continuous active learning can be used to complete the remainder of the review. The CAL algorithm will efficiently analyze the entire collection to locate any documents that are contextually similar to the “close” documents located during the analytics review, and will continuously learn from every coding decision made along the way.

The full functionality of a CAL tool should be exploited to make every effort to locate responsive documents. One or more synthetic seeds, reflecting the content of a document that would be responsive, should be used to train the tool, along with the “close” documents. Contextually diverse documents should be included, to ensure a thorough exploration of the entire collection. Again, any documents that are contextually close to being responsive should be considered positive, in order to prioritize any truly responsive documents along the way.

Once the remaining documents have been reviewed, and no responsive documents have been found, the underlying statistics can be used to essentially prove a negative—that is, to establish the statistical maximum number of responsive documents in the collection. Proportionality considerations will then determine whether a full review is warranted, just to locate that small number of potentially responsive documents.

Given the effectiveness of a CAL review, this approach is actually more thorough than a traditional random statistical sample review, and is a reasonable way to demonstrate the absence of responsive documents in the collection without having to review the entire collection.

What’s Next

Next, we’ll explore how Predict is effectively used for protection tasks (such as privilege review) and knowledge generation tasks such as opposing party production reviews, dep prep and issue analysis.


1. In the simulation, we generated the synthetic seed by excerpting language from a few positive documents, and then compiling the excerpts in a separate, single document. In practice, a synthetic seed would typically be created without reference to known positive documents, as they would serve as seed documents in their own right.
2. This simulation, which actually included an evaluation of the effectiveness of 57 different starting seeds, is described in greater detail an earlier blog: 57 Ways to Leave Your (Linear) Lover.
4. The relationship between statistical parameters and the number of documents reviewed can easily be determined using readily available tools such as those found at