Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn
Follow Us:
Technology, Techniques and Best Practices

Comparing Active Learning to Random Sampling: Using Zipf’s Law to Evaluate Which is More Effective for TAR

Maura Grossman and Gordon Cormack just released another blockbuster article,  “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’” 7 Federal Courts Law Review 286 (2014). The article was in part a response to an earlier article in the same journal by Karl Schieneman and Thomas Gricks, in which they asserted that Rule 26(g) imposes “unique obligations” on parties using TAR for document productions and suggested using techniques we associate with TAR 1.0 including:

Training the TAR system using a random “seed” or “training” set as opposed to one relying on judgmental sampling, which “may not be representative of the entire population of electronic documents within a given collection.”

From the beginning, we have advocated a TAR 2.0 approach that uses judgmental seeds (selected by the trial team using all techniques at their disposal to find relevant documents). Random seeds are a convenient shortcut to approximating topical coverage, especially when one doesn’t have the algorithms and computing resources to model the entire document collection. But they are neither the best way to train a modern TAR system nor the only way eliminate bias and ensure full topical coverage. We have published several research papers and articles showing that documents selected via continuous active learning and contextual diversity (active modeling of the entire document set) consistently beat training documents selected at random.

Who is George Zipf and what does he have to do with any of this? Read on.

Who is George Zipf and what does he have to do with any of this? Read on.

In this latest article and in a recent peer-reviewed study (which we discussed in a recent blog post), Cormack and Grossman also make a compelling case that random sampling is one of the least effective methods for training. Indeed, they conclude that even the worst examples of keyword searches are likely to bring better training results than random selection, particularly for populations with low levels of richness.

Ralph Losey has also written on the issue recently, arguing that relying on random samples rather than judgmental samples “ignores an attorney’s knowledge of the case and the documents. It is equivalent to just rolling dice to decide where to look for something, instead of using your own judgment, your own skills and insights.”

Our experience, like theirs, is that judgmental samples selected using attorneys’ knowledge of the case can get you started more effectively, and that any possible bias arising from the problem of “unknown unknowns” can be easily corrected with the proper tools. We also commonly see document collections with very low richness, which makes these points even more important in actual practice.

Herb Roitblat, the developer of OrcaTec (which apparently uses random sampling for training purposes), makes cogent arguments for the superiority of a random-only sampling approach. (See his posts here and here.) His main argument is that training using judgmental seeds backed by review team judgments leads to “bias” because “you don’t know what you don’t know.” Our experience, which is now backed by the peer-reviewed research of Cormack and Grossman, is that there are more effective ways to avoid bias than simple random sampling.

We certainly agree with Roitblat that there is always a concern for “bias” – at least in the sense of not knowing what you don’t know (rather than any potential “lawyer manipulation” that Ralph Losey properly criticizes in his recent post). But it isn’t necessarily a problem that prevents us from ever using judgmental seeds.  Sometimes – depending on the skill, knowledge, and nature of the relevant information in the matter itself – judgmental selection of training documents can indeed cover all relevant aspects of a matter. At other times, judgmental samples will miss some topics because of the problem of “unknown unknowns” but this deficiency can be easily corrected by using an algorithm such as contextual diversity that models the entire document population and actively identifies topics that need human attention rather than blindly relying on random samples to hit those pockets of documents the attorneys missed.

The goal of this post, however, is not to dissect the arguments on either side of the random sampling debate. Rather, we want to have a bit of fun and show you how Zipf’s Law and the many ways it is manifest in document populations argue strongly for the form of active learning we use to combat the possibility of bias. Our method is called “contextual diversity” and Zipf’s law can help you understand why it is more efficient and effective than random sampling for ensuring topical coverage and avoiding bias.

What is Contextual Diversity?

A typical TAR 1.0 workflow often involves an expert reviewing a relatively small set of documents, feeding those documents into the TAR system to do its thing, and then having a review team check samples to confirm the machine’s performance. But in TAR 2.0, we continuously use all the judgments of the review teams to make the algorithm smarter (which means you find relevant documents faster). Like Cormack and Grossman, we feed documents ranked high for relevance to the review team and use their judgments to train the system. However, our continuous learning approach also throws other options into the mix to further improve performance, combat potential bias, and ensure complete topical coverage. One of these options that addresses all three concerns is our “contextual diversity” algorithm.

Contextual diversity refers to documents that are highly different from the ones already seen and judged by human reviewers (and thus under a TAR 2.0 approach have been used in training), no matter how those documents were initially selected for review. Because our system ranks all of the documents in the collection on a continual basis, we know a lot about documents – both those the review team has seen but also (and more importantly) those the review team has not yet seen. The contextual diversity algorithm identifies documents based on how significant and how different they are from the ones already seen, and then selects training documents that are the most representative of those unseen topics for human review.

It’s important to note that we’re not solving the strong AI problem here – the algorithm doesn’t know what those topics mean or how to rank them.  But it can see that these topics need human judgments on them and then select the most representative documents it can find for the reviewers. This accomplishes two things: (1) it is constantly selecting training documents that will provide the algorithm with the most information possible from one attorney-document view, and (2) it is constantly putting the next biggest “unknown unknown” it can find in front of attorneys so they can judge for themselves whether it is relevant or important to their case.

We feed in enough of the contextual diversity documents to ensure that the review team gets a balanced view of the document population, regardless of how any initial seed documents were selected. But we also want the review team focused on highly relevant documents, not only because this is their ultimate goal, but also because these documents are highly effective at further training the TAR system as Cormack and Grossman now confirm. Therefore, we want to make the contextual diversity portion of the review as efficient as possible. How we optimize that mix is a trade secret, but the concepts behind contextual diversity and active modeling of the entire document population are explained below.

Contextual Diversity: Explicitly Modeling the Unknown


Contextual Diversity

In the above example, assume you started the training with contract documents found either through keyword search or witness interviews. You might see terms like the ones above the blue dotted line showing up in the documents. Documents 10 and 11 have human judgments on them (indicated in red and green), so the TAR system can assign weights to the contract terms (indicated in dark blue).

But what if there are other documents in the collection, like those shown below the dotted line, that have highly technical terms but few or none of the contract terms?  Maybe they just arrived in a rolling collection. Or maybe they were there all along but no one knew to look for them. How would you find them based on your initial terms? That’s the essence of the bias argument.

With contextual diversity, we analyze all of the documents. Again, we’re not solving the strong AI problem here, but the machine can still plainly see that there is a pocket of different, unjudged documents there. It can also see that one document in particular, 1781, is the most representative of all those documents, being at the center of the web of connections among the unjudged terms and unjudged documents. Our contextual diversity engine would therefore select that one for review, not only because it gives the best “bang for the buck” for a single human judgment, but also because it gives the attorneys the most representative and efficient look into that topic that the machine can find.

So Who is This Fellow Named Zipf?

Zipf’s law was named after the famed American linguist George Kingsley Zipf, who died in 1950. The law refers to the fact that many types of data, including city populations and a host of other things studied in the physical and social sciences, seem to follow a Zipfian distribution, which is part of a larger family of power law probability distributions. (You can read all about Zipf’s Law in Wikipedia, where we pulled this description.)

Why does this matter? Bear with us, you will see the fun in this in just a minute.

It turns out that the frequency of words and many other features in a body of text tend to follow a Zipfian power law distribution. For example, you can expect the most frequent word in a large population to be twice as frequent as the second most common word, three times as frequent as the third most common word and so on down the line. Studies of Wikipedia itself have found that the most common word, “the,” is twice as frequent as the next, “of,” with the third most frequent word being “and.” You can see how the frequency drops here:

Frequency Drop

Topical Coverage and Zipf’s Law

Here’s something that may sound familiar. Ever seen a document population where documents about one topic were pretty common, and then those about another topic were somewhat less common, and so forth down to a bunch of small, random stuff? We can model the distribution of subtopics in a document collection using Zipf’s law too. And doing so makes it easier to see why active modeling and contextual diversity is both more efficient and more effective than random sampling.

Here is a model of our document collection, broken out by subtopics. The subtopics are shown as bubbles, scaled so that their areas follow a Zipfian distribution. The biggest bubble represents the most prevalent subtopic, while the smaller bubbles reflect increasingly less frequent subtopics in the documents.


Now to be nitpicky, this is an oversimplification. Subtopics are not always discrete, boundaries are not precise, and the modeling is much too complex to show accurately in two dimensions. But this approximation makes it easier to see the main points.

So let’s start by taking a random sample across the documents, both to start training a TAR engine and also to see what stories the collection can tell us:


We’ll assume that the documents are distributed randomly in this population, so we can draw a grid across the model to represent a simple random sample.  The red dots reflect each of 80 sample documents. The portion of the grid outside the circle is ignored.

We can now represent our topical coverage by shading the circles covered by the random sample.


You can see that a number of the randomly sampled documents hit the same topical circles. In fact, over a third (32 out of 80) fall in the largest subtopic.  A full dozen are in the next largest. Others hit some of the smaller circles, which is a good thing, and we can see that we’ve colored a good proportion of our model yellow with this sample.

So in this case, a random sample gives fairly decent results without having to do any analysis or modeling of the entire document population. But it’s not great. And with respect to topical coverage, it’s not exactly unbiased, either. The biggest topics have a ton of representation, a few tiny ones are now represented by a full 1/80 of the sample, and many larger ones were completely missed. So a random sample has some built-in topical bias that varies randomly – a different random sample might have biases in different directions. Sure, it gives you some rough statistics on what is more or less common in the collection, but both attorneys and TAR engines usually care more about what is in the collection rather than how frequently it appears.

So what if we actually can perform analysis and modeling of the entire document population? Can we do better than a random sample? Yes, as it turns out, and by quite a bit.

Let’s attack the problem again by putting attorney eyes on 80 documents – the exact same effort as before – but this time we select the sample documents using a contextual diversity process.  Remember: our mission is to find representative documents from as many topical groupings as possible to train the TAR engine most effectively, avoid any bias that might arise from judgmental sampling, and to help the attorneys quickly learn everything they need to from the collection. Here is the topical coverage achieved using contextual diversity for the the same size review set of 80 documents:


Now look at how much of that collection is colored yellow. By actively modeling the whole collection, the TAR engine with contextual diversity uses everything it can see in the collection to give reviewing attorneys the most representative document it can find from each subtopic. By using its knowledge of the documents to systematically work through the subtopics, it avoids massively oversampling the larger ones and relying on random samples to eventually hit all the smaller ones (which, given the nature of random samples, need to be very large to have a decent chance of hitting all the small stuff). It achieves much broader coverage for the exact same effort.


Below is a comparison of the two different approaches to selecting a sample of 80 documents. The subtopics colored yellow were covered by both. Orange indicates those that were found using contextual diversity but missed by the random sample of the same size. Dark blue shows those smaller topics that the random sample hit but contextual diversity did not reach in the first 80 seed documents.


Finally, here is a side by side comparison of the topical coverage achieved for the same amount of review effort:


Now imagine that the attorneys started with some judgmental seeds taken from one or two topics. You can also see how contextual diversity would help balance the training set and keep the TAR engine from running too far down only one or two paths at the beginning of the review by methodically giving attorneys new, alternative topics to evaluate.

When subtopics roughly follow a Zipfian distribution, we can easily see how simple random sampling tends to produce inferior results compared to an active learning approach like contextual diversity. (In fact, systematic modeling of the collection and algorithmic selection of training documents beats random sampling even if every topic were the exact same size, but for other reasons we will have to discuss in a separate post.) For tasks such as a review for production where the recall and precision standards are based on “reasonableness” and “proportionality,” random sampling – while not optimal – may be good enough. But if you’re looking for a needle in a haystack or trying to make sure that the attorneys’ knowledge about the collection is complete, random sampling quickly falls farther and farther behind active modeling approaches.

So while we strongly agree with the findings of Cormack and Grossman and their conclusions regarding active learning, we also know through our own research that the addition of contextual diversity to the mix makes the results even more efficient.

After all, the goal here is to find relevant documents as quickly and efficiently as possible while also quickly helping attorneys learn everything they need to know to litigate the case effectively. George Zipf is in our corner.

John Tredennick About John Tredennick
A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer’s Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer’s Guide to Microsoft Excel 2007 (ABA Press 2009).John has been widely honored for his achievements. In 2013, he was named by the American Lawyer as one of the top six “E-Discovery Trailblazers” in their special issue on the “Top Fifty Big Law Innovators” in the past fifty years. In 2012, he was named to the FastCase 50, which recognizes the smartest, most courageous innovators, techies, visionaries and leaders in the law. London’s CityTech magazine named him one of the “Top 100 Global Technology Leaders.” In 2009, he was named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region. Also in 2009, he was named the Top Technology Entrepreneur by the Colorado Software and Internet Association.John is the former chair of the ABA’s Law Practice Management Section. For many years, he was editor-in-chief of the ABA’s Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents. In his spare time, you will find him competing on the national equestrian show jumping circuit.

Mark Noel About Mark M. Noel, J.D.
Mark Noel is a managing director of professional services at Catalyst Repository Systems, where he specializes in helping clients use technology-assisted review, advanced analytics, and custom workflows to handle complex and large-scale litigations. Before joining Catalyst, Mark was a member of the Acuity team at FTI Consulting, co-founded an e-discovery software startup, and was an intellectual property litigator with Latham & Watkins LLP.


Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review

This past weekend I received an advance copy of a new research paper prepared by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” They have posted an author’s copy here.

The study attempted to answer one of the more important questions surrounding TAR methodology:

Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning?

Their conclusion was unequivocal:

The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P < 0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents.

Among passive-learning methods, significantly less human review effort (P < 0.01) is required when keywords are used instead of random sampling to select the initial training documents. Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of “stabilization” – determining when training is adequate, and therefore may stop.

The seminal paper is slated to be presented in July in Australia at the annual conference of the Special Interest Group on Information Retrieval (SIGIR), a part of the Association for Computing Machinery (ACM).

Why is This Important?

Their research replicates the findings of much of our own research and validates many of the points about TAR 2.0 which we have made in recent posts here and in an article published in Law Technology News. (Links to the LTN article and our prior posts are collected at the bottom of this post.)

Specifically, Cormack and Grossman conclude from their research that:

  1. Continuous active learning is more effective than the passive learning (one-time training) used by most TAR 1.0 systems.
  2. Judgmental seeds and review using relevance feedback is more effective than random seeds, and particularly for sparse collections.
  3. Subject matter experts aren’t necessary for training; review teams and relevance feedback are just as effective for training.

Their findings open the door to a more fluid approach to TAR, one we have advocated and used for many years. Rather than have subject matter experts click endlessly through randomly selected documents, let them find as many good judgmental seeds as possible. The review team can get going right away and the team’s judgments can be continuously fed back into the system for even better ranking. Experts can QC outlying review judgments to ensure that the process is as effective as possible.

While I will summarize the paper, I urge you to read it for yourself. At eight pages, this is one of the easier to read academic papers I have run across. Cormack and Grossman write in clear language and their points are easy to follow (for us non-Ph.D.s). That isn’t always true of other SIGIR/academic papers.

The Research

Cormack and Grossman chose eight different review projects for their research. Four came from the 2009 TREC Legal Track Interactive Task program. Four others came from actual reviews conducted in the course of legal proceedings.

The review projects under study ranged from a low of 293,000 documents to a high of just over 1.1 million. Prevalence (richness) was generally low, which is often the case in legal reviews, ranging from 0.25% to 3.92% with a mean of 1.18%.

The goal here was to compare the effectiveness of three TAR protocols:

  1. SPL: Simple Passive Learning.
  2. SAL: Simple Active Learning.
  3. CAL: Continuous Active Learning (with Relevance Feedback).

The first two protocols are typical in TAR 1.0 training. Simple Passive Learning uses randomly-selected documents for the training. Simple Active Learning uses judgmental seeds for the first round of training but then uses computer-generated seeds to further improve the classifier.

Continuous Active Learning also starts with judgmental seeds (like SAL) but then trains using review teams working primarily with highly relevant documents after the first ranking. Catalyst uses a CAL-like approach in Predict, but we further supplement the relevance feedback with a balanced, dynamically selected mixture that includes both relevance feedback and additional documents selected using Predict’s contextual diversity engine.

As the authors explain:

The underlying objective of CAL is to find and review as many of the responsive documents as possible, as quickly as possible. The underlying objective of SAL, on the other hand, is to induce the best classifier possible, considering the level of training effort.

For each of the eight review projects, Cormack and Grossman ran simulated reviews using each of the three protocols. They used review judgments already issued for each project as “ground truth.” They then simulated running training and review in 1,000 seed increments. In a couple of cases they ran their experiments using 100 seed batches but this proved impractical for the entire project.

(As a side note, we have done experiments in which the size of the batch is varied. Generally, the faster and tighter the iteration, the higher the recall for the exact same amount of human effort. Rather than delve further into this here, this topic deserves and will shortly receive its own separate blog post.)

The Results

Here are the key conclusions Cormack and Grossman reached:

The results show SPL to be the least effective TAR method, calling into question not only its utility, but also commonly held beliefs about TAR. The results also show that SAL, while substantially more effective than SPL, is generally less effective than CAL, and as effective as CAL only in a best-case scenario that is unlikely to be achieved in practice.

Our primary implementation of SPL, in which all training documents were randomly selected, yielded dramatically inferior results to our primary implementations of CAL and SAL, in which none of the training documents were randomly selected.

In summary, the use of a seed set selected using a simple keyword search, composed prior to the review, contributes to the effectiveness of all of the TAR protocols investigated in this study.

Perhaps more surprising is the fact that a simple keyword search, composed without prior knowledge of the collection, almost always yields a more effective seed set than random selection, whether for CAL, SAL, or SPL. Even when keyword search is used to select all training documents, the result is generally superior to that achieved when random selection is used. That said, even if passive learning is enhanced using a keyword-selected seed or training set, it is still dramatically inferior to active learning.

While active-learning protocols employing uncertainty sampling are clearly more effective than passive-learning protocols, they tend to focus the reviewer’s attention on marginal rather than legally significant documents. In addition, uncertainty sampling shares a fundamental weakness with passive learning: the need to dene and detect when stabilization has occurred, so as to know when to stop training. In the legal context, this decision is fraught with risk, as premature stabilization could result in insufficient recall and undermine an attorney’s certification of having conducted a reasonable search under (U.S.) Federal Rule of Civil Procedure 26(g)(1)(B).

Their article includes several Yield/Gain charts illustrating their findings. I won’t repost them all here, but here is their first chart as an example. It shows comparative results for the three protocols for TREC Topic 201. You can easily see that Continuous Active Learning resulted in a higher level of recall after review of fewer documents, which is the key to keeping review costs in check:


No doubt some people will challenge their conclusions, but they cannot be ignored as we move from TAR 1.0 to the next generation.


As the authors point out:

This study highlights an alternative approach – continuous active learning with relevance feedback – that demonstrates superior performance, while avoiding certain problems associated with uncertainty sampling and passive learning. CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.

From the beginning, we argued that continuous ranking/continuous learning is more effective than the TAR 1.0 approach of a one-time cutoff. We have also argued that clicking through thousands of randomly selected seeds is less effective for training than actively finding relevant documents and using them instead. And, lastly, we have issued our own research suggesting strongly that subject matter experts are not necessary for TAR training and can be put to better use finding useful documents for training and doing QC of outlier review team judgments, continuously and on the fly, with the ability to always determine where the outlier pool is shifting as review continues.

It is nice to see that others agree and are providing even more research to back up these important points. TAR 2.0 is here to stay.

Further reading on Catalyst’s research and findings about TAR 2.0:

How Many Documents in a Gigabyte? A Quick Revisit to this Interesting Subject

Big-Pile-of-PaperI read with great interest a recent article in Law Technology News, “Four Examples of Predictive Coding Success,” by Barclay T. Blair.

The purpose of the article was to report on several successful uses of technology-assisted review. While that was interesting, my attention was drawn to another aspect of the report. Three of the case studies provided data shedding further light on that persistent e-discovery mystery: “How many documents in a gigabyte?”

Readers may remember that I wrote about this in two earlier posts:

The issue is important to e-discovery professionals because it can impact review estimates. If the number is 15,000 docs in a gigabyte, as some have suggested in the past, then a one terabyte collection could contain 1.5 million documents for consideration. Conversely, if the number is lower, say 5,000 docs in a gigabyte, as the “conservative” wing has suggested, then the number for review consideration drops to 500,000.

Our earlier posts raised eyebrows because our studies suggested the number was far less than the 10-15,000 put forth by many others. Indeed, we suggested the number was even lower than 5,000 in many cases. Our average was more like 3,000 docs in a gigabyte, although we carefully qualified that as being heavily dependent on the type of files being considered. Spreadsheets can go well below 3,000 docs to a gigabyte. Text files and HTML emails can go higher, sometimes by a lot. It was our weighted average, measured over about 100 million documents, that led to our conclusion.

So, the article interested me because in three of the four reports, the author included both the gigabyte size and the associated number of documents. I wanted to see how those numbers compared to our research. As you will see, the results were interesting.

Here are excerpts from the three case studies and my comments:

Case 1: The Document Dump

Opposing counsel produced more than 800,000 documents (267GB) to D4′s client. Their client suspected that this was a “document dump,” i.e., a production purposely designed to make it difficult to find relevant information.

I get 2,996 documents per gigabyte for this example (800,000 ÷ 267).

Case 2: A Merger At Risk-When Speed Matters

A Fortune 500 multinational company was merging when the U.S. Department of Justice notified them of a Second Request under the Hart-Scott-Rodino Antitrust Improvements Act. In scope were 5 million documents (1.6 terabytes), some in foreign languages.

I get 3,125 documents per gigabyte in this case (5,000,000 ÷ 1,600).

Case 3: High Cost/Low Merits

A major healthcare company operating in 100+ countries faced a contract dispute. The general counsel of litigation characterized the litigation as “high cost/low merits” because discovery costs were liable to be disproportionately high relative to the legal risks. The collection contained 238 GB of electronically stored information (720,000 documents) from various sources.

I get 3,025 documents per gigabyte in this case (720,000 ÷ 238 ).

What does this mean?

I don’t want to overplay this data but I sure found it interesting that the number of documents in a gigabyte was consistent across these three case studies. It was also fun to see that the figures came out pretty close to the numbers we derived in our studies. A lot of people challenged my original conclusion because it strayed pretty far from conventional wisdom. I confess to having doubts myself but the data told me otherwise.

So, how many documents in a gigabyte? I suspect the number is much closer to 3,000 than 10,000. What have you seen in your documents?

Using TAR in International Litigation: Does Predictive Coding Work for Non-English Languages?

[This article originally appeared in the Winter 2014 issue of EDDE Journal, a publication of the E-Discovery and Digital Evidence Committee of the ABA Section of Science and Technology Law.]

Although still relatively new, technology-assisted review (TAR) has become a game changer for electronic discovery. This is no surprise. With digital content exploding at unimagined rates, the cost of review has skyrocketed, now accounting for over 70% of discovery costs. In this environment, a process that promises to cut review costs is sure to draw interest, as TAR, indeed, has.

flag-24502_640Called by various names—including predictive coding, predictive ranking, and computer-assisted review—TAR has become a central consideration for clients facing large-scale document review. It originally gained favor for use in pre-production reviews, providing a statistical basis to cut review time by half or more. It gained further momentum in 2012, when federal and state courts first recognized the legal validity of the process.

More recently, lawyers have realized that TAR also has great value for purposes other than preparing a production. For one, it can help you quickly find the most relevant documents in productions you receive from an opposing party. TAR can be useful for early case assessment, for regulatory investigations and even in situations where your goal is only to speed up the production process through prioritized review. In each case, TAR has proven to save time and money, often in substantial amounts.

But what about non-English language documents? For TAR to be useful in international litigation, it needs to work for languages other than English. Although English is used widely around the world,[1] it is not the only language you will see if you get involved in multi-national litigation, arbitration or regulatory investigations. Chinese, Japanese and Korean will be common for Asian transactions; German, French, Spanish, Russian, Arabic and Hebrew will be found for matters involving European or Middle Eastern nations. Will TAR work for documents in these languages?

download-pdf-versionMany industry professionals doubted that TAR would work on non-English documents. They reasoned that the TAR process was about “understanding” the meaning of documents. It followed that unless the system could understand the documents—and presumably computers understand English—the process wouldn’t be effective.

The doubters were wrong. Computers don’t actually understand documents; they simply catalog the words in documents. More accurately, we call what they recognize “tokens,” because often the fragments (numbers, misspellings, acronyms and simple gibberish) are not even words. The question, then, is whether computers can recognize tokens (words or otherwise) when they appear in other languages.

The simple answer is yes. If the documents are processed properly, TAR can be just as effective for non-English as it is for English documents. After a brief introduction to TAR and how it works, I will show you how this can be the case. We will close with a case study using TAR for Japanese documents.

What is TAR?

TAR is a process through which one or more humans interact with a computer to train it to find relevant documents. Just as there are many names for the process, there are many variations of it. For simplicity’s sake, I will use Magistrate Judge Andrew J. Peck’s definition in Da Silva Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012), the first case to approve TAR as a method to shape document review:

By computer assisted review, I mean tools that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with a human reviewer.

It is about as simple as that:

  1. A human (subject matter expert, often a lawyer) sits down at a computer and looks at a subset of documents.
  2. For each, the lawyer records a thumbs-up or thumbs-down decision (tagging the document). The TAR algorithm watches carefully, learning during this training.
  3. When the training session is complete, we let the system rank and divide the full set of documents between (predicted) relevant and irrelevant.[2]
  4. We then review the relevant documents, ignoring the rest.

The benefits from this process are easy to see. Let’s say you started with a million documents that otherwise would have to be reviewed by your team. If the computer algorithm predicted with the requisite degree of confidence that 700,000 are likely not-relevant, you could then exclude them from the review for a huge savings in review costs. That is a great result, particularly if you are the one paying the bills. At four dollars a document for review (to pick a figure), you just saved $2.8 million. And the courts say this is permissible.

How is TAR Used?

TAR can be used for several purposes. The classic use is to prioritize the review process, typically in anticipation of an outgoing production. You use TAR to sort the documents in order of likely relevance. The reviewers do their work in that order, presumably reviewing the most likely relevant ones first. When they get to a point where the number of relevant documents drops significantly, suggesting that they have seen most of them, the review stops. Somebody then samples the unreviewed documents to confirm that the number of relevant documents remaining is sufficiently low to justify discontinuing further, often expensive, review.

We can see the benefits of a TAR process through the following chart, which is known as a yield curve:


A yield curve presents the results of a ranking process and is a handy way to visualize the difference between two processes. The X axis shows the percentage of documents that are available for review. The Y axis shows the percentage of relevant documents found at each point in the review.

As a base line, I created a gray diagonal line to show the progress of a linear review (which essentially moves through the documents in random order). Without a better means for ordering the documents by relevance, the recall rates for a linear review typically match the percentage of documents actually reviewed¾hence the straight line. By the time you have seen 80% of the documents, you probably have seen 80% of the relevant documents.

The blue line shows the progress of a TAR review. Because the documents are ranked in order of likely relevance, you see more relevant documents at the front end of your review. Following the blue line up the Y axis, you can see that you would reach 50% recall (have viewed 50% of the relevant documents) after about 5% of your review. You would have seen 80% of the relevant documents after reviewing just 10% of the total review population.

This is a big deal. If you use TAR to organize your review, you can dramatically improve the speed at which you find relevant documents over a linear review process. Assuming the judge will let you stop your review after you find 80% of the documents (and some courts have indicated this is a valid stopping point), review savings can be substantial.

You can also use this process for other purposes. Analyzing inbound productions is one good example. These are often received shortly before depositions begin. If you receive a million or so documents in a production, how are you to quickly find which ones are important and which are not?

Here is an example where counsel reviewed about 200,000 documents received not long before depositions commenced and found about 5,700 which were “hot.” Using a small set of their own judgments about the documents for training, we were able to demonstrate that they would have found the same number of hot documents after reviewing only 38,000 documents. They could have stopped there and avoided the costs of reviewing the remaining 120,000 documents.


You can also use this process for early case assessment, using the ranking engine to place a higher number of relevant documents at the front of the stack.

What about non-English Documents?

To understand why TAR can work with non-English documents, you need to know two basic points:

  1. TAR doesn’t understand English or any other language. It uses an algorithm to associate words with relevant or irrelevant documents.
  2. To use the process for non-English documents, particularly those in Chinese and Japanese, the system has to first tokenize the document text so it can identify individual words.

We will hit these topics in order.

1. TAR Doesn’t Understand English

It is beyond the province of this article to provide a detailed explanation of how TAR works, but a basic explanation will suffice for our purposes. Let me start with this: TAR doesn’t understand English or the actual meaning of documents. Rather, it simply analyzes words algorithmically according to their frequency in relevant documents compared to their frequency in irrelevant documents.

Think of it this way. We train the system by marking documents as relevant or irrelevant. When I mark a document relevant, the computer algorithm analyzes the words in that document and ranks them based on frequency, proximity or some other such basis. When I mark a document irrelevant, the algorithm does the same, this time giving the words a negative score. At the end of the training process, the computer sums up the analysis from the individual training documents and uses that information to build a search against a larger set of documents.

While different algorithms work differently, think of the TAR system as creating huge searches using the words developed during training. It might use 10,000 positive terms, with each ranked for importance. It might similarly use 10,000 negative terms, with each ranked in a similar way. The search results would come up in an ordered fashion sorted by importance, with the most likely relevant ones coming first.

None of this requires that the computer know English or the meaning of the documents or even the words in them. All the computer needs to know is which words are contained in which documents.

2. If Documents are Properly Tokenized, the TAR Process Will Work.

Tokenization may be an unfamiliar term to you but it is not difficult to understand. When a computer processes documents for search, it pulls out all of the words and places them in a combined index. When you run a search, the computer doesn’t go through all of your documents one by one. Rather, it goes to an ordered index of terms to find out which documents contain which terms. That’s why search works so quickly. Even Google works this way, using huge indexes of words.

As I mentioned, however, the computer doesn’t understand words or even that a word is a word. Rather, for English documents it identifies a word as a series of characters separated by spaces or punctuation marks. Thus, it recognizes the words in this sentence because each has a space (or a comma) before and after it. Because not every group of characters is necessarily an actual “word,” information retrieval scientists call these groupings “tokens,” and the act of identifying these tokens for the index as “tokenization.”

All of these are tokens:

  • Bank
  • door
  • 12345
  • barnyard
  • mixxpelling

And so on. All of these will be kept in a token index for fast search and retrieval.

Certain languages, such as Chinese and Japanese, don’t delineate words with spaces or western punctuation. Rather, their characters run through the line break, often with no breaks at all. It is up to the reader to tokenize the sentences in order to understand their meaning.

Many early English-language search systems couldn’t tokenize Asian text, resulting in search results that often were less than desirable. More advanced search systems, like the one we chose for Catalyst, had special tokenization engines which were designed to index these Asian languages and many others that don’t follow the Western conventions. They provided more accurate search results than did their less-advanced counterparts.

Similarly, the first TAR systems were focused on English-language documents and could not process Asian text. At Catalyst, we added a text tokenizer to make sure that we handled these languages properly. As a result, our TAR system can analyze Chinese and Japanese documents just as if they were in English. Word frequency counts are just as effective for these documents and the resulting rankings are as effective as well.

A Case Study to Prove the Point.

Let me illustrate this with an example from a matter we handled not long ago. We were contacted by a major U.S. law firm that was facing review of a set of mixed Japanese and English language documents. It wanted to use TAR on the Japanese documents, with the goal of cutting both the cost and time of the review, but was uncertain whether TAR would work with Japanese.

Our solution to this problem was to first tokenize the Japanese documents before beginning the TAR process. Our method of tokenization—also called segmentation—extracts the Japanese text and then uses language-identification software to break it into words and phrases that the TAR engine can identify.

To achieve this, we loaded the Japanese documents into our review platform. As we loaded the documents, we performed language detection and extracted the Japanese text. Then, using our proprietary technology and methods, we tokenized the text so the system would be able to analyze the Japanese words and phrases.

With tokenization complete, we could begin the TAR process. In this case, senior lawyers from the firm reviewed 500 documents to create a reference set to be used by the system for its analysis. Next, they reviewed a sample set of 600 documents, marking them relevant or non-relevant. These documents were then used to train the system so it could distinguish between likely relevant and likely non-relevant documents and use that information for ranking.

After the initial review, and based on the training set, we directed the system to rank the remainder of the documents for relevance. The results were compelling:

  • The system was able to identify a high percentage of likely relevant documents (98%) and place them at the front of the review queue through its ranking process. As a result, the review team would need to review only about half of the total document population (48%) to cover the bulk of the likely relevant documents.
  • The remaining portion of the documents (52%) contained a small percentage of likely relevant documents. The review team reviewed a random sample from this portion and found only 3% were likely relevant. This low percentage suggested that these documents did not need to be reviewed, thus saving the cost of reviewing over half the documents.

By applying tokenization before beginning the TAR process, the law firm was able to target its review toward the most-likely relevant documents and to reduce the total number of documents that needed to be reviewed or translated by more than half.


As corporations grow increasingly global, legal matters are increasingly likely to involve non-English language documents. Many believed that TAR was not up to the task of analyzing non-English documents. The truth, however, is that with the proper technology and expertise, TAR can be used with any language, even difficult Asian languages such as Chinese and Japanese.

Whether for English or non-English documents, the benefits of TAR are the same. By using computer algorithms to rank documents by relevance, lawyers can review the most important documents first, review far fewer documents overall, and ultimately cut both the cost and time of review. In the end, that is something their clients will understand, no matter what language they speak.


[1] It is, for example, the language used in almost every commercial deal involving more than one country.

[2] Relevant in this case means relevant to the issues under review. TAR systems are often used to find responsive documents but they can be used for other inquiries such as privileged, hot or relevant to a particular issue.

Predictive Ranking (TAR) for Smart People

Predictive Ranking, aka predictive coding or technology-assisted review, has revolutionized electronic discovery–at least in mindshare if not actual use. It now dominates the dais for discovery programs, and has since 2012 when the first judicial decisions approving the process came out. Its promise of dramatically reduced review costs is top of mind today for general counsel. For review companies, the worry is about declining business once these concepts really take hold.

download-pdf-versionWhile there are several “Predictive Coding for Dummies” books on the market, I still see a lot of confusion among my colleagues about how this process works. To be sure, the mathematics are complicated, but the techniques and workflow are not that difficult to understand. I write this article with the hope of clarifying some of the more basic questions about TAR methodologies.

I spent over 20 years as a trial lawyer and partner at a national law firm and another 15 at Catalyst. During that time, I met a lot of smart people–but few actual dummies. This article is for smart lawyers and legal professionals who want to learn more about TAR. Of course, you dummies are welcome to read it too.

What is Predictive Ranking?

Predictive Ranking is our name for an interactive process whereby humans train a computer algorithm to identify useful (relevant) documents. We call it Predictive Ranking because the goal of these systems is to rank documents in order of estimated relevance. Humans do the actual coding.

How does it work?

In its simplest form, it works like the Pandora Internet radio service. Pandora has thousands of songs in its archive but no idea what kind of music you like. Its goal is to play music from your favorite artists but also to present new songs you might like as well.


How does Pandora do this? For those who haven’t tried it, you start by giving Pandora the name of one or more artists you like, thus creating a “station.” Pandora begins by playing a song or two by the artists you have selected. Then, it chooses a similar song or artist you didn’t select to see if you like it. You answer by clicking a “thumbs up” or “thumbs down” button. Information retrieval (IR) scientists call this “relevance feedback.”

Pandora analyzes the songs you like, as well as the songs you don’t to make its suggestions. It looks at factors such as melody, harmony, rhythm, form, composition and lyrics to find similar songs. As you give it feedback on its suggestions, it takes that information into account in order to make better selections the next time. The IR people would call this “training.”

The process continues as you listen to your radio station. The more feedback you provide, the smarter the system gets. The end result is Pandora plays a lot of music you like and, occasionally, something you don’t like.

Predictive Ranking works in a similar way–only you work with documents rather than songs. As you train the system, it gets smarter about which documents are relevant to your inquiry and which are not.[1] It is as simple as that.

OK, but how does Predictive Ranking really work?

Well, it really is just like Pandora, although there are a few more options and strategies to consider. Also, different vendors approach the process in different ways, which can cause some confusion. But here is a start toward explaining the process.

1. Collect the documents you want to review and feed them to the computer.

To start, the computer has to analyze the documents you want to review (or not review), just like Pandora needs to analyze all the music it maintains. While approaches vary, most systems analyze the words in your documents in terms of frequency in the document and across the population.

Some systems require that you collect all of the documents before you begin training. Others, like our system, allow you to add documents during the training process. Either approach works. It is just a matter of convenience.

2. Start training/review.

You have two choices here. You can start by presenting documents you know are relevant (or non-relevant) to the computer or you can let the computer select documents for your consideration. With Pandora, you typically start by identifying an artist you like. This gives the computer a head start on your preferences. In theory, you could let Pandora select music randomly to see if you liked it but this would be pretty inefficient.

Either way, you essentially begin by giving the computer examples of which documents you like (relevant) and which you don’t like (non-relevant).[2] The system learns from the examples which terms tend to occur in relevant documents and which in non-relevant ones. It then develops a mathematical formula to help it predict the relevance of other documents in the population.

There is an ongoing debate about whether training requires the examples to be provided by subject matter experts (SMEs) to be effective. Our research suggests that review teams assisted by SMEs are just as effective as SMEs alone. See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?  Others disagree. See, for example, Ralph Losey’s posts about the need for SME’s to make the process effective.

Insight Predict

3. Rank the documents by relevance.

This is the heart of the process. Based on the training you have provided, the system creates a formula which it uses to rank (order) your documents by estimated relevance.

4. Continue training/review (rinse and repeat).

Continue training using your SME or review team. Many systems will suggest additional documents for training, which will help the algorithm get better at understanding your document population. For the most part, the more training/review you do, the better the system will be at ranking the unseen documents.

5. Test the ranking.

How good a job did the system do on the ranking? If the ranking is “good enough,” move forward and finish your review. If it is not, continue your training.

Some systems view training as a process separate from review. Following this approach, your SME’s would handle the training until they were satisfied that the algorithm was fully trained. They would then let the review teams look at the higher-ranked documents, possibly discarding those below a certain threshold as non-relevant.

Our research suggests that a continuous learning process is more effective. We therefore recommend that you feed reviewer judgments back to the system for a process of continuous learning. As a result, the algorithm continues to get smarter, which can mean even fewer documents need to be reviewed. See: TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?

6. Finish the review.

The end goal is to finish the review as efficiently and cost-effectively as possible. In a linear review, you typically review all of the documents in the population. In a predictive review, you can stop well before then because the important documents have been moved to the front of the queue. You save on both review costs and the time it takes to complete the review.

Ultimately, “finishing” means reviewing down the ranking until you have found enough relevant documents, with the concept of proportionality taking center stage. Thus, you stop after reviewing the first 20% of the ranking because you have found 80% of the relevant documents. Your argument is that the cost to review the remaining 80% of the document population just to find the remaining 20% of the relevant documents is unduly burdensome.[3]

That’s all there is to it. While there are innumerable choices in applying the process to a real case, the rest is just strategy and execution.

How do I know if the process is successful?

That, of course, is the million-dollar question. Fortunately, the answer is relatively easy.

The process succeeds to the extent that the document ranking places more relevant documents at the front of the pack than you might get when the documents are ordered by other means (e.g. by date or Bates number). How successful you are depends on the degree to which the Predictive Ranking is better than what you might get using your traditional approach.

Let me offer an example. Imagine your documents are represented by a series of cells, as in the below diagram. The orange cells represent relevant documents and the white cells non-relevant.

Random Docs

What we have is essentially a random distribution, or at least there is no discernable pattern between relevant and non-relevant. In that regard, this might be similar to a review case where you ordered documents by Bates number or date. In most cases, there is no reason to expect that relevant documents would appear at the front of the order.

This is typical of a linear review. If you review 10% of the documents, you likely will find 10% of the relevant documents. If you review 50%, you will likely find 50% of the relevant documents.

Take a look at this next diagram. It represents the outcome of a perfect ordering. The relevant documents come first followed by non-relevant documents.

Perfect Docs

If you could be confident that the ranking worked perfectly, as in this example, it is easy to see the benefit of ordering by rank. Rather than review all of the documents to find relevant ones, you could simply review the first 20% and be done. You could confidently ignore the remaining 80% (perhaps after sampling them) or, at least, direct them to a lower-priced review team.

Yes, but what is the ranking really like?

Since this is directed at smart people, I am sure you realize that computer rankings are never that good. At the same time, they are rarely (if ever) as bad as you might see in a linear review.

Following our earlier examples, here is how the actual ranking might look using Predictive Ranking:

Actual Docs

We see that the algorithm certainly improved on the random distribution, although it is far from perfect. We have 30% of the relevant documents at the top of the order, followed by an increasing mix of non-relevant documents. At about a third of the way into the review, you would start to run out of relevant documents.

This would be a success by almost any measure. If you stopped your review at the midway point, you would have seen all but one relevant document. By cutting out half the document population, you would save substantially on review costs.

How do I measure success?

If the goal of Predictive Ranking is to arrange a set of documents in order of likely relevance to a particular issue, the measure of success is the extent to which you meet that goal. Put as a question, “Am I getting more relevant documents at the start of my review than I might with my typical approach (often a linear review).”[4] If the answer is yes, then how much better?

To answer these questions, we need to take two additional steps. First, for comparison purposes, we will want to measure the “richness” of the overall document population. Second, we need to determine how effective our ranking system turned out to be against the entire document population.

1. Estimating richness: Richness is a measure of how many relevant documents are in your total document population. Some people call this “prevalence,” as a reference to how prevalent relevant documents are in the total population. For example, we might estimate that 15% or the documents are relevant, with 85% non-relevant. Or we might say document prevalence is 15%.

How do we estimate richness? Once the documents are assembled, we can use random sampling for this purpose. In general, a random sample allows us to look at a small subset of the document population, and make predictions about the nature of the larger set.[5] Thus, from the example above, if our sample found 15 documents out of a hundred to be relevant, we would project a richness of 15%. Extrapolating that to the larger population (100,000 for example), we might estimate that there were about 15,000 relevant documents to be found.

For those really smart people who understand statistics, I am skipping a discussion about confidence intervals and margins of error. Let me just say that the larger the sample size, the more confident you can be in your estimate. But, surprisingly, the sample size does not have to be that large to provide a high degree of confidence.

Systematic Random2. Evaluating the ranking: Once the documents are ranked, we can then sample the ranking to determine how well our algorithm did in pushing relevant documents to the top of the stack. We do this through a systematic random sample.

In a systematic random sample, we sample the documents in their ranked order, tagging them as relevant or non-relevant as we go. Specifically, we sample every Nth document from the top to the bottom of the ranking (e.g. every 100th document). Using this method helps ensure that we are looking at documents across the ranking spectrum, from highest to lowest.

As an aside, you can actually use a systematic random sample to determine overall richness/prevalence and to evaluate the ranking. Unless you need an initial richness estimate, say for review planning purposes, we recommend you do both steps at the same time.

You can read more about simple and systematic random sampling in an earlier article I wrote, Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

Comparing the results

We can compare the results of the systematic random sample to the richness of our population by plotting what scientists call a “yield curve.” While this may sound daunting, it is really rather simple. It is the one diagram you should know about if you are going to use Predictive Ranking.

Linear Yield Curve

A yield curve can be used to show the progress of a review and the results it yields, at least in number of relevant documents found. The X axis shows the percentage of documents to be reviewed (or reviewed). The Y axis shows the percentage of relevant documents found (or you would expect to fin) at any given point in the review.

Linear review: Knowing that the document population is 15% rich (give or take) provides a useful baseline against which we can measure the success of our Predictive Ranking effort. We plot richness as a diagonal line going from zero to 100%. It reflects the fact that, in a linear review, we expect the percentage of relevant documents to correlate to the percentage of total documents reviewed.

Following that notion, we can estimate that if the team were to review 10% of the document population, they would likely see 10% of the relevant documents. If they were to look at 50% of the documents, we would expect them to find 50% of the relevant documents, give or take. If they wanted to find 80% of the relevant documents, they would have to look at 80% of the entire population.

Predictive Review: Now let’s plot the results of our systematic random sample. The purpose is to show how the review might progress if we reviewed documents in a ranked order, from likely relevant to likely non-relevant. We can easily compare it to a linear review to measure the success of the Predictive Ranking process.

Predictive Yield Curve

You can quickly see that the line for the Predictive Review goes up more steeply than the one for linear review. This reflects the fact that in a Predictive Review the team starts with the most likely relevant documents. The line continues to rise until you hit the 80% relevant mark, which happens after a review of about 10-12% of the entire document population. The slope then flattens, particularly as you cross the 90% relevant line. That reflects the fact that you won’t find as many relevant documents from that point onward. Put another way, you will have to look through a lot more documents before you find your next relevant one.

We now have what we need to measure the success of our Predictive Ranking project. To recap, we needed:

  1. A richness estimate so we have an idea of how many relevant documents are in the population.
  2. A systematic random sample so we can estimate how many relevant documents got pushed to the front of the ordering.

It is now relatively easy to quantify success. As the yield curve illustrates, if I engage in a Predictive Review, I will find about 80% of the relevant documents after only reviewing about 12% of total documents. If I wanted to review 90% of the relevant documents, I could stop after reviewing just over 20% of the population. My measure of success would be the savings achieved over a linear review.[6]

At this point we move into proportionality arguments. What is the right stopping point for our case? The answer depends on the needs of your case, the nature of the documents and any stipulated protocols among the parties. At the least, the yield curve helps you frame the argument in a meaningful way.

Moving to the advanced class

My next post will take this discussion to a higher level, talking about some of the advanced questions that dog our industry. For a sneak peak on my thinking, take a look at a few of the articles we have already posted on the results of our research. I think you now have a foundation upon which to understand these and just about any other article on the topic you might find.

I hope this was helpful. Post your questions below. I will try and answer them (or pass them on to our advisory board for their thoughts).

Further reading:

[1] IR specialists call these documents “relevant” but they do not mean relevant in a legal sense. They mean important to your inquiry even though you may not plan on introducing them at trial. You could substitute hot, responsive, privileged or some other criterion depending on the nature of your review.

[2] I could use “irrelevant” but that has a different shade of meaning for the IR people so I bow to their use of non-relevant here. Either word works for this discussion.

[3] Sometimes at the meet-and-confer, the parties agree on Predictive Ranking protocols, including the relevance score that will serve as the cut-off for review.

[4] I will use a linear review (essentially a random relevance ordering) as a baseline because that is the way most reviews are done. If you review based on conceptual clusters or some other method, your baseline for comparison would be different.

[5] Note that an estimate based on a random sample is not valid unless you are sampling against the entire population. If you get new documents, you have to redo your sample.

[6] In a separate post we will argue that the true measure of success with Predictive Ranking is the total amount saved on the review, taking into consideration software and hardware along with human costs. Time savings is also an important factor. IR scientist William Webber has touched on this point here: Total annotation cost should guide automated review.

The Five Myths of Technology Assisted Review, Revisited

Tar PitOn Jan. 24, Law Technology News published John’s article, “Five Myths about Technology Assisted Review.” The article challenged several conventional assumptions about the predictive coding process and generated a lot of interest and a bit of dyspepsia too. At the least, it got some good discussions going and perhaps nudged the status quo a bit in the balance.

One writer, Roe Frazer, took issue with our views in a blog post he wrote. Apparently, he tried to post his comments with Law Technology News but was unsuccessful. Instead, he posted his reaction on the blog of his company, Cicayda. We would have responded there but we don’t see a spot for replies on that blog either.

We love comments like these and the discussion that follows. This post offers our thoughts on the points raised by Mr. Frazer and we welcome replies right here for anyone interested in adding to the debate. TAR 1.0 is a challenging-enough topic to understand. When you start pushing the limits into TAR 2.0, it gets really interesting. In any event, you can’t move the industry forward without spirited debate. The more the merrier.

We will do our best to summarize Mr. Frazer’s comments and offer our responses.

1. Only One Bite at the Apple?

Mr. Frazer suggests we were “just a bit off target” on the nature of our criticism. He rightly points out that litigation is an iterative (“circular” he calls it) business.

When new information comes into a case through initial discovery, with TAR/PC you must go back and re-train the system. If a new claim or new party gets added, then a document previously coded one way may have a completely different meaning and level of importance in light of the way the data facts changed. This is even more so the case with testimony, new rounds of productions, non-party documents, heck even social media, or public databases. If this happens multiple times, you wind up reviewing a ton of documents to have any confidence in the system. Results are suspect at best. Cost savings are gone. Time is wasted. Attorneys, entrusted with actually litigating the case, do not and should not trust it, and thus smartly review even more documents on their own at high pay rates. I fail to see the value of “continuous learning”, or why this is better. It cannot be.

He might be missing our point here. Certainly he is correct when he says that more training is always needed when new issues arise, or when new documents are added to the collection. And there are different ways of doing that additional training, some of which are smarter than others. But that is the purview of Myth #4, so we’ll address it below. Let us, therefore, clarify that when we’re talking about “only one bite of the apple,” we’re talking about what happens when the collection is static and no new issues are added.

To give a little background, let us explain what we understand to be the current, gold standard TAR workflow, to which we are reacting. What we see the industry in general saying is that the way TAR works is that you get ahold of the most senior, experienced, expertise-laden individual that you can, and then you sit that person down in front of an active learning TAR training (learning) algorithm and have the person iteratively judge thousands of documents until the system “stabilizes.” Then you apply the results of that learning to your entire collection and batch out the top documents to your contract review team for final proofing. At the point you do that batching, says the industry, learning is complete, finito, over, done. Even if you trust your contract review team to judge batched-out documents, none of those judgments are ever fed back into the system, to be used for further training to improve the ranking from the algorithm.

Myth #1 says that it doesn’t have to be that way. What “continuous learning” means is that all judgments during the review should get fed back into the core algorithm to improve the quality with regard to any and all documents that have not yet received human attention. And the reason why it is better? Empirically, we’ve seen it to be better. We’ve done experiments in which we’ve trained an algorithm to “stability,” and then we’ve continued training even during the batched-out review phase – and seen that the total number of documents that need to be examined until a defensible threshold is hit continues to go down. Is there value in being able to save even more on review costs? We think that there is.

You can see some of the results of our testing on the benefits of continuous learning here.

2. Are Subject Matter Experts Required?

We understand that this is a controversial issue and that it will take time before people become comfortable with this new approach. To quote Mr. Frazer:

To the contrary, using a subject matter expert is critical to the success of litigation – that is a big reason AmLaw 200 firms get hired. Critical thinking and strategy by a human lawyer is essential to a well-designed discovery plan. The expertise leads to better training decisions and better defensibility at the outset. I thus find your discussion of human fallibility and review teams puzzling.

Document review is mind numbing and people are inconsistent in tagging which is one of the reasons for having the expert in the first place. With a subject matter expert, you are limiting the amount of fallible humans in the process. We have seen many “review lawyers” and we have yet to find one who does not need direction by a subject matter expert. One of the marketing justifications for using TAR/PC is that human review teams are average to poor at finding relevant documents – it must be worse without a subject matter expert. I do agree with your statement that “most senior attorneys… feel they have better things to do than TAR training in any event.” With this truth, you have recognized the problem with the whole system: Spend $100k+ on a review process, eat up a large portion of the client’s litigation budget, yet the expert litigation team who they hired has not looked at a single document, while review attorneys have been “training” the system? Not relying on an expert seems to contradict your point  3, ergo.

Again, the nature of this response indicates that you are approaching this from the standard TAR workflow, which is to have your most senior expert sit for a number of days and train to stability, and then never have the machine learn anything again. To dispel the notion that this workflow is the only way in which TAR can or even should work is one reason we’re introducing these myths in the first place. What we are saying in our Myth #2 is not that you would never have senior attorneys or subject matter experts involved in any way. Of course that person should train the contract reviewers.  Rather, we are saying that you can involve non-experts, non-senior attorneys in the actual training of the system and achieve results that are just as good as having *only* a senior attorney sit and train the system.  And our method dramatically lowers both your total time cost and your total monetary cost in the process.

For example, imagine a workflow in which your contract reviewers, rather than your senior attorney, do all the initial training on those thousands of documents. Then, at some point later in the process, your senior attorney steps in and re-judges a small fraction of the existing training documents. He or she corrects via the assistance of a smart algorithm only the most egregious, inconsistent training outliers and then resubmits for a final ranking. We’ve tested this workflow empirically, and found that it yields results that are just as good, if not better, than the senior attorney working alone, training every single document. (See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?)

Moreover, you can get through training more quickly, because now you have a team working in parallel, rather than an individual working serially. Add to that the fact that your senior attorney does not always have free time the moment that training needs to get done, the flexibility to bring that senior attorney in at a later point, and do a tenth of the work that he or she would otherwise have to do, and you have a recipe for success. That’s what this myth is about – the notion that the rest of the industry has, and your own response indicates, that unless a senior attorney does every action that in any way affects the training of the system, it is a recipe for disaster. It is not; that is a myth.

And again, we justify this not through appeals to authority (“that is a big reason AmLaw 200 firms get hired”), but through empirical methods. We’ve tested it out extensively. But if appeals to authority are what is needed to show that the algorithms we employ are capable of successfully supporting these alternative workflows, we can do so. Our in-house senior research scientist, Jeremy Pickens, has his PhD from one of the top two information retrieval research labs in the country, and not only holds numerous patents on the topic, but has received the best paper award at the top information retrieval conference in the world (ACM SIGIR). Blah blah blah. But we’d prefer not to have to resort to appeals to authority, because empirical validation is so much more effective.

Please note also that we in no way *force* you to use non-senior attorneys during the training process. You are of course free to work however you want to work. However, should time or money be an issue, we’ve designed our system so as to allow you to successfully and more efficiently work in a way that doesn’t only require senior attorneys or experts to do your training, exclusively.

You can see the results of our research on the use of subject matter experts here and here.

3. Must I Train on Randomly Selected Documents?

We pointed out in our article that it is a myth that TAR training can only be on random documents.

You totally glossed over bias. Every true scientific study says that judgmental sampling is fraught with bias. Research into sampling in other disciplines is that results from judgmental sampling should be accepted with extreme caution at best. It is probably even worse in litigation where “winning” is the driving force and speed is omnipresent. Yet, btw, those who advocate judgmental sampling in eDiscovery, Grossman, e.g., also advocate that the subject matter experts select the documents – this contradicts your points in 2. You make a true point about the richness of the population making it difficult to find documents, but this militates against random selection, not for it. To us this shows another reason why TAR/PC is broken. Indeed “clicking through thousands of random documents is boring” – but this begs the question. It was never fun reviewing a warehouse of banker’s documents either. But it is real darn fun when you find the one hidden document that ties everything together, and wins your case. What is boring or not fun has nothing to do with the quality of results in a civil case or criminal investigation.

I hope we have managed to clarify that Myth #2 is not actually saying that you never have to involve a senior attorney in any way, shape or form. Rather we believe that a senior attorney doesn’t have to do every single piece of the TAR training, in all forms, at all times. Once you understand this, you quickly realize that there is no contradiction between what Maura Grossman is saying and what we are saying.

If you want to do judgmental sampling, let your senior attorney and all of his or her wisdom be employed in the process of creating the search queries used to find interesting documents. But instead of requiring that senior person to then look at every single result of those search queries, let your contract reviewers comb through those. In that manner, you can involve your senior attorney where his or her skills are the most valuable and where his or her time is the most precious. It takes a lot less time to issue a few queries than it does to sit and judge thousands of documents. Are we the only vendor out there aware of the notion that the person who issues searches for the documents and who judges all the found documents doesn’t have to be the same person? We would hope not, but perhaps we are.

Now, to the issue of bias. You’re quite right to be concerned about this, and we fault the necessary brevity of our original article in not being able to go into enough detail to satisfy your valid concerns. So we would recommend reading the following article, as it goes into much more depth about how bias is overcome when you start judgmentally, and it backs up its explanations empirically: Predictive Ranking: Technology-Assisted Review Designed for the Real World.

Imagine your TAR algorithm as a seesaw. That seesaw has to be balanced, right? So you have many in the industry saying that the only way to balance it is to randomly select documents along the length of that seesaw. In that manner, you’ll approximately have the same number of docs, at the same distance from the center, on both sides of the seesaw. And the seesaw will therefore be balanced. Judgmental sampling, on the other hand, is like plopping someone down on the far end of the seesaw. That entire side sinks down, and raises the other side high into the air, throwing off the balance. Well, in that case, the best way to balance the seesaw again is to explicitly plop down another equal weight on the exact opposite end of the seesaw, bringing the entire system to equilibrium.

What we’ve designed in the Catalyst system is an algorithm that we call “contextual diversity.” “Contextual” refers to where things have already been plopped down on that seesaw. The “diversity” means “that area of the collection that is most about the things that you know that you know nothing about,” i.e. that exact opposite end of the seesaw, rather than some arbitrary, random point. Catalyst’s contextual diversity algorithm explicitly finds and models those balancing points, and surfaces those to your human judge(s) for coding. In this manner, you can both start judgmentally *and* overcome bias. We apologize that this was not as clear in the original 5 Myths article, but we hope that this explanation helps.

We go into this subject in more detail here.

4. You Can’t Start Training until You Have All of Your Documents

One of the toughest issues in TAR systems is the requirement that you collect all of your documents before you start TAR training. This limitation stems from the use of a randomly selected control set to both guide training and provide defensibility. If you add new documents to the mix (rolling uploads), they will not be represented in the control set. Thus even if you continue training with some of these new documents, your control set would be invalid and you lose defensibility.

You might have missed that point in your comments:

I think this is similar to #1 in that you are not recognizing the true criticism that things change too much in litigation. While you can start training whenever you want and there are algorithms that will allow you to conduct new rounds on top of old rounds – the real problem is that you must go back and change previous coding decisions because the nature of the case has changed. To me, this is more akin to “continuous nonproductivity” than “continuous learning.”

The way in which we natively handle rolling uploads from a defensibility standpoint is to not rely on a human-judged control set. There are other intelligent metrics we use to monitor the progress of training, so we do not abandon the need for reference, or our defensibility, altogether – just the need for expensive, human-judged reference.

The way other systems have to work, in order to keep their control set valid, is to judge another statistically valid sample of documents from the newly arrived set. And in our experience, in the cases we’ve dealt with over the past five years, there have been on average around 67 separate uploads until the collection was complete. Let’s be conservative and assume you’re dealing with a third of that – say only 20 separate uploads from start to finish. As each new upload arrives, you’re judging 500 randomly selected documents just to create a control set. 500 * 20 = 10,000. And let’s suppose your senior attorney gets through 50 documents an hour. That’s 200 hours of work just to create a defensible control set, with not even a single training document yet judged.  And since you’ve already stated that you need to hire an AmLaw 200 senior attorney to judge these documents, at $400/hour that would be $80,000. Our approach saves you that money right off the bat by being able to natively handle the control set/rolling upload issue. Plug in your own numbers if you don’t like these, but our guess is that it’ll still add up to a significant savings.

But the control set is only half of the story. The other half is the training itself. Let us distinguish if we may between an issue that changes, and a collection that changes. If it is your issue itself (i.e. your definition of responsiveness) that changes when new documents are collected, then certainly nothing we’ve explicitly said in these Five Myths will address that problem. However, if all you are talking about is the changing expression of an unchanging issue, then we are good to go.

What do we mean by the changing expression of an unchanging issue? We mean that if you’ve collected from your engineering custodians first, and started to train the system on those documents, and then suddenly a bunch of marketing custodians arrive, that doesn’t actually change the issue that you’re looking for. What responsiveness was about before is still what responsiveness is about now. However, how that issue is expressed will change. The language that the marketers use is very different than the language that the engineers use, even if they’re talking about the same responsive “aboutness.”

This is exactly why training is a problem for the standard TAR 1.0 workflow. If you’re working in a way that requires your expert to judge all the documents up front, then if the collection grows (by adding the marketing documents to the engineering collection), that expert’s work is not really applicable to the new information and you have to go back to the drawing board, selecting another set of random documents so as to avoid bias, feed those yet again to a busy, time-pressed expert, etc. That is extremely inefficient.

What we do with our continuous learning is once again employ that “contextual diversity” algorithm that we mentioned above. Let us return to the seesaw analogy. Imagine that you’ve got your seesaw, and through the training that you’ve done it is now completely balanced. Now, a new subset of (marketing) documents appears; that is like adding a third plank to the original seesaw. Clearly what happens is that now things are unbalanced again. The two existing planks sink down to the ground, and that third plank shoots up into the air. So how do we solve for that imbalance, without wasting the effort that has gone into understanding the first two planks? Again, we use our contextual diversity algorithm to find the most effective balance point, in the most efficient, direct (aka non-random) manner possible.

Contextual diversity cares neither why nor how the training over a collection of documents is imbalanced. It simply detects the most effective points that, once pressure is applied to those points, rebalance the system. It does not matter if the seesaw started with two planks and then suddenly grew a third via rolling uploads, or if the seesaw started with three planks, and someone’s judgmental sampling only hit two of those planks. In both cases, there is imbalance, and in both cases, the algorithm explicitly models and corrects for that imbalance.

You can read more about this topic here.

5. TAR Does Not Work for non-English Documents

Many people have now realized that, properly done, TAR can work for other languages including the challenging CJK (Chinese, Japanese and Korean) languages. As we explained in the article, TAR is a “mathematical process that ranks documents based on word frequency. It has no idea what the words mean.”

Mr. Frazer seems to agree but is pitching a different kind of system for TAR:

Words are the weapons of lawyers so why in the world would you use a technology that does not know what they mean? TAR & PC are, IMHO, roads of diversion (perhaps destruction in an outlier case) for the true litigator. They are born out of the need to reduce data, rather than know what is in a large data set. They ignore a far better system is one that empowers the subject matter experts, the true litigators, and even the review team to use their intelligence, experience, and unique skills to find what they need, quickly and efficiently, regardless of how big the data is. A system is needed to understand the words in documents, and read them as a human being would.

There are a lot of points that we could say in response to this, but this post is lengthy enough as it is. So let us briefly make just two points. The first is that we think Natural Language Processing (which apparently your company uses) techniques are great. There is a lot of value there. And frankly, we think that NLP techniques complement, rather than oppose, the more purely statistical techniques.

That said, our second point is simply to note that in some of the cases that we’ve dealt with here at Catalyst, we have datasets in which over 85% of the documents are computer source code. Where there is no natural language, there can be no NLP. And yet TAR still has to be able to handle those documents as well. So perhaps we should extend Myth #5 to say that it’s a myth that “TAR Does Not Work for Non-Human Language Documents.”


In writing the Five Myths of TAR, our point wasn’t to claim that Catalyst has the only way to address the practical limitations of early TAR systems. To the contrary, there are many approaches to technology-assisted review which a prospective user will want to consider, and some are more cost and time effective than others. Rather, our goal was to dispel certain myths that limit the utility of TAR and let people know that there are practical answers to early TAR limitations. Debating which of those answers works best should be the subject of many of these discussions. We enjoy the debate and try to learn from others as we go along.

How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

Big-Pile-of-PaperFor an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”

At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar.

Just for fun, I decided to take another look. I was curious to see what the numbers might be in 2014 with new files and perhaps new file sizes.  So I asked my team to help me with an update. Here is a report on the process we followed and what we learned.[1]

How Many Docs 2014?

For this round, we collected over 10 million native files (“documents” or “docs”) from 44 different cases. The sites themselves were not chosen for any particular reason, although we looked for a minimum of 10,000 native files on each. We also chose not to use several larger sites where clients used text files as substitutes for the original natives.

Our focus for the study was on standard office files, such as Word, Excel, PowerPoint, PDFs and email. These are generally the focus of most review and discovery efforts and seem most important to our inquiry. I will discuss several other file types a bit later in this report.

I should also note that the files used in our study had already been processed and were loaded into Catalyst Insight, our discovery repository. Thus, they had been de-NISTed, de-duped (or not depending on client requests), culled, reduced, etc. My point was not to exclude any particular part of the document population. Rather, those kinds of files don’t often make it past processing and are typically not included in a review.

That said, here is a summary of what we found when we focused on the office and standard email files.


The weighted average for these files comes out to 3,124 docs per gigabyte. Not surprisingly, there are wide variations in the counts for different types of files. You can see these more easily when I chart the data.


The average in 2014 was about 20% higher than our averages in 2011 (2,500 docs per gigabyte). Does that suggest a decrease in the size of the files we create today? I doubt it. People seem to be using more and more graphical elements in their PowerPoints and Word files, which would suggest larger file sizes and lower docs per gigabyte. My guess is that we are seeing routine sampling variation here rather than some kind of trend.

EML and Text Files

We had several sites with EML files (about 2 million in total). These were extracted from Lotus Notes databases by one of our processing partners (our process would normally output to HTML rather than EML). An EML file is essentially a text file with some HTML formatting. Including the EML files will increase the averages for files per gigabyte.

We also had sites with a large number of text and HTML files. Some were chat logs, others were purchase orders and still others were product information. If your site has a lot of these kinds of files, you will see higher averages in your overall counts.

Here are the numbers we retrieved for these kinds of files.


Because of the large number of EML files, the weighted average here is much higher, at just over 15,500 files per gigabyte.

Image Files

Many sites had a large number of image files. In some cases they were small GIF files associated with logos or other graphics displaying on the email itself. It appears that these files were extracted from the email during processing and treated as separate records. In our processing, we don’t normally extract these types of files but rather leave them with the original email.

In any event, here are the numbers associated with these types of files.


We did not find many image files in our last study. I don’t know if these numbers reflect different collection practices, different case issues or just happened to fall in the 2014 matters.

In any event, I did not think it would be helpful to our inquiry to include image files (and especially GIF files) because they are not typically useful in a review. If you do, the number of docs per gigabyte will be affected.

What Did We Learn?

In many ways, the figures from this study confirmed my conclusions in 2011. Once again, it seems that the industry-accepted figure of 10,000 files per gigabyte is over the mark and even the lower range figure of 5,000 seems high. For the typical files being reviewed by our clients, our number is closer to 3,000.

That value changes depending on what files make up your review population. If your site has a large number of EML or text files, expect the averages to get higher. If, conversely, you have a lot of Excel files, the average can drop sharply.

In my discussion so far, I broke out the different file types in logical groupings. If we include all of the different file types in our weighted averages, the numbers come out like this:


Including all files gets us awfully close to 5,000 documents per gigabyte, which was the lower range of the industry estimates I found. If you pull out the EML files, the number drops to 3,594.39, which is midway between our 2011 estimate (2,500) and 5.000 documents per gigabyte.

Which is the right number for you? That depends on the type of files you have and what you are trying to estimate. What I can say is that for the types of office files typically seen in a review, the number isn’t 10,000 or anything close. We use a figure closer to 3,000 for our estimates.


[1] I wish to particularly thank Greg Berka, Catalyst’s director of application support, for helping to assemble the data used in this article. He also assisted in the 2011 study.

My Prediction for 2014: E-Discovery is Dead — Long Live Discovery!

There has been debate lately about the proper spelling of the shorthand version for electronic discovery. Is it E-Discovery or e-discovery or Ediscovery or eDiscovery? Our friends at DSIcovery recently posted on that topic and it got me thinking.

Big Dog

The big dog today is electronic discovery.

The industry seems to be of differing minds. Several of the leading legal and business publications use e-discovery, as do we. They include Law Technology News, the other ALM publications, the Wall Street Journal (see here, for example), the ABA Journal (example), Information Week (example) and Law360 (example).

Also using e-discovery are industry analysts such as Gartner and 451 Research.

A number of vendors favor the non-hyphenated versions eDiscovery or ediscovery. They include: Symantec, EPIQ, Kroll Ontrack, Recommind and HP Autonomy.

One other vendor, kCura, goes with e-discovery.

Which is It?

So, which is it? E-Discovery or eDiscovery (or some variant on the caps)? I say none of the above. It is time that we take the “E” out of e-discovery once and for all.

When I started practicing law thirty years ago,  there was no “E” in discovery. Rather, it was about exchanging paper documents prior to trial. As documents went digital, the need to consider electronic discovery arose. This new category needed a name. E-discovery seemed a perfect fit.

Today, discovery of electronic files makes up almost the entirety of this thing we call discovery. To be sure, paper documents can still be found but they are the tail that no longer wags the proverbial dog. The big dog today is electronic discovery.

I predict that in 2014 we will start to put the “D” back in Discovery, realizing that we don’t need a special category for what is now a ubiquitous process. Discovery is what this is all about, and it is all digital. Perhaps people will start calling it D-Discovery but I hope not.  Discovery sounds just fine to me.

Cast Your Vote

So, will 2014 be the year we take the “E” out of E-Discovery? That’s my bet. Dealing with electronic files is no longer a segment of the discovery process, it  *is*  the process. It is time we recognized that fact and drop the hyphen.

This is discovery after all—no more, no less. There is no longer a distinction between producing paper and electronic files (and the paper ones are all digitized anyway). Why do we need a specialized species when it has already swallowed up the entire genus?

E-Discovery is dead. Long live Discovery.

Tell me if you agree.

Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

One of the givens of traditional CAR (computer-assisted review)[1] in e-discovery is the need for random samples throughout the process. We use these samples to estimate the initial richness of the collection (specifically, how many relevant documents we might expect to see). We also use random samples for training, to make sure we don’t bias the training process through our own ideas about what is and is not relevant.

John Tredennick CarLater in the process, we use simple random samples to determine whether our CAR succeeded. We sample the discards to ensure that we have not overlooked too many relevant documents.

But is that the best route for our CAR? Maybe not. Our road map leads us to believe a process called a systematic random sampling will get you to your destination faster with fewer stops. In this post, I will tell you why.[2]

About Sampling

Even we simpleton lawyers (J.D.s rather than Ph.D.s) know something about sampling. Sampling is the process by which we examine a small part of a population in the hope that our findings will be representative of the larger population. It’s what we do for elections. It’s what we do for QC processes. It’s what we do with a box of chocolates (albeit with a different purpose).

Academics call it “probability sampling” because the goal is to determine the probability of the sample matching the large population. It is so called because every element has some known probability of being sampled (which then allows us to make probabilistic statements about the likelihood that the sample is representative of the larger population).

There are several ways to do this including simple random, systematic and stratified sampling. For this article, my focus is on the first two methods: simple random and systematic.

Simple Random Sampling

The most basic form of sampling is “simple random sampling.” The key here is to employ a sampling process that ensures that each member of the sampled population has an equal chance of being selected.[3] With documents, we do this with a random number generator and a unique ID for each file. The random number generator is used to select IDs in a random order.

I am not going to go into confidence intervals, margins of error or other aspects of the predictive side of sampling. Suffice it to say that the size of your random sample helps determine your confidence about how well the sample results will match the larger population. That is a fun topic as well but my focus today is on types of sampling rather than the size of the sampling population needed to draw different conclusions.

Systematic Random Sampling

A “systematic random sample” differs from a simple random sample in two key respects. First, you need to order your population in some fashion. For people, it might be alphabetically or by size. For documents, Systemic Samplingwe order them by their relevance ranking. Any order works as long as the process is consistent and it serves your purposes.

The second step is to draw your sample in a systematic fashion. You do so by choosing every Nth person (or document) in the ranking from top to bottom. Thus, you might select every 10th person in the group to compose your sample. As long as you don’t start with the first person on the list but instead select your first person in the order randomly (say from the top ten people), your sample is a valid form of random sampling and can be used to determine the characteristics of the larger population. You can read more about all of this at Wikipedia and many more sources. Don’t just take my word for it.

Why Would I Use a Systematic Random Sample?

This is where the rubber meets the road (to overuse the metaphor). For CAR processes, there are a lot of advantages to using a systematic random sample over a random sample. Those advantages include getting a better picture of the document population and increasing your chances of finding relevant documents.

Let me start by emphasizing an important point. When you’re drawing a sample, you want it to be “representative” of the population you’re sampling. For instance, you’d like each sub-population to be fairly and proportionally represented. This particularly matters if sub-populations differ in the quality you want to measure.

Drawing a simple random sample means that we’re not, by our selection method, deliberately or negligently under-representing some subset of the population. However, it can still happen that, due to random variability, we can oversample one subset of the population, and undersample another. If the sub-populations do differ systematically, then this may skew our results. We may miss important documents.

An Example: Sports Preferences for Airport Travelers

William Webber gave me a sports example to help make the point.

Say we are sampling travelers in a major international airport to see what sports they like (perhaps to help the airport decide what sports to televise in the terminal). Now, sports preference tends to differ among countries, and airline flights go between different countries (and at different times of day you’ll tend to find people from different areas traveling).

So it would not be a good idea to just sit at one gate, and sample the first hundred people off the plane. Let’s say you’re in Singapore Airport. If you happen to pick a stop-over flight on the way from Australia to India, your sample will “discover” that almost all air travelers in the terminal are cricket fans. Or perhaps there is a lawyers’ convention in Bali, and you’ve picked a flight from the United States, then your study might convince the airport to show American football around the clock.

Let’s say instead that you are able to draw a purely random sample of travelers (perhaps through boarding passes–let’s not worry about the practicality of getting to these randomly sampled individuals). You’ll get a better spread, but you might tend to bunch up on some flights, and miss others–perhaps 50% more samples on the Australian-India flight, and 50% fewer on the U.S.-Bali one.

This might be particularly unfortunate if some individuals were more “important” than the others. To develop the scenario, let’s say the airport also wanted to offer sports betting for profit. Then maybe American football is an important niche market, and it would be unfortunate if your random sample happened to miss those well-heeled lawyers dying to bet on that football game I am watching as I write this post.

What you’d prefer to do (and again, let’s ignore practicalities) is to spread your sample out, so that you are assured of getting an even coverage of gates and times (and even seasons of the year). Of course, your selection will still have to be random within areas, and you still might get unlucky (perhaps the lawyer you catch hates football and is crazy about croquet). But you’re more likely to get a representative sample if your approach is systematic rather than simple random.

Driving our CAR Systematically

Let’s get back in our CAR and talk about the benefit of sampling against our document ranking. In this case, the value we’re trying to estimate is “relevance” (or more exactly, something about the distribution of relevance). Here, the population differentiation is a continuous one, from the highest relevance ranking to the lowest. This differentiation is going to be strongly correlated with the value we’re trying to measure.

Highly ranked documents are more likely to be relevant than lowly ranked ones (or so we hope). So if our simple random sample happened by chance to over-sample from the top of the ranking, we’re going to overstate the total number of relevant documents in the population.

John Tredennick SportscarLikewise, if our random sample happened by chance to oversample from the bottom of the ranking, our sample might understate the relevance population. By moving sequentially through the ranking from top to bottom, a systematic random sample removes the danger of this random “bunching,” and so makes our estimate more accurate overall.

At different points in the process, we might also want information about particular parts of the ranking. First, we may be trying to pick a cutoff. That suggests we need good information about the area around our candidate cutoff point.

Second, we might wonder if relevant documents have managed to bunch in some lower part of the ranking. It would be unfortunate if our simple random sample happened not to pick any documents from this region of interest. It would mean that we might miss relevant documents.

With a systematic random sample, we are guaranteed that each area of the ranking is equally represented. That is the point of the sample, to draw from each segment in the ranking (decile for example) and see what kinds of documents live there. Indeed, if we are already determined to review the top-ranking documents, we might want to place more emphasis on the lower rankings. Or not, depending on our goals and strategy.

Either way, the point of a systematic random sampling is to ensure that we sample documents across the ranking–from top to bottom. We do so in the belief that it will provide a more representative look at our document population and give us a better basis to draw a “yield curve.”[4] To be fair, however, the document selected from that particular region might not be representative of that region. Whether you choose random or systematic, there is always the chance that you will miss important documents.

Does it Work?

Experiments have shown us that documents can bunch together in a larger population. Back in the paper days, I knew that certain custodians were likely to have the “good stuff” in their correspondence files and I always went there first in my investigation. Likewise, people generally kept types of documents together in boxes, which made review quicker. I could quickly dismiss boxes of receipts when they didn’t matter to my case while spending my time on research notebooks when they did.

Similarly, and depending on how they were collected, relevant documents are likely to be found in bunches across a digital population. After all, I keep files on my computer in folders much like I did before I had a computer. It helps with retrieval. Other people do as well. The same is true for email, which I dutifully folder to keep my inbox clear.

So, no problem if those important documents get picked up during a random sample, or even because they are similar to other documents tagged as relevant. However, sometimes they aren’t picked up. They might still be bunched together but simply fall toward the bottom of the ranking. Then you miss out on valuable documents that might be important to your case.

While no method is perfect, we believe that a systematic random sample offers a better chance that these bunches get picked up during the sampling process. The simple reason is that we are intentionally working down the ranking to make sure we see documents from all segments of the population.

From experiments, we have seen this bunching across the ranking (yield) curve. By adding training documents from these bunches, we can quickly improve the ranking, which means we find more relevant documents with less effort. Doing so means we can review fewer documents at a lower cost. The team is through more quickly as well, which is important when deadlines are tight.

Many traditional systems don’t support systematic random sampling. If that is the case with your CAR, you might want to think about an upgrade. There is no doubt that simple random sampling will get you home eventually but you might want to ride in style. Take a systematic approach for better results and leave the driving to us.


[1] I could use TAR (Technology Assisted Review) but it wouldn’t work as well for my title. So, today it is CAR. Either term works for me.

[2] Thanks are due to William Webber, Ph.D., who helped me by explaining many of the points raised in this article. Webber is one of a small handful of CAR experts in the marketplace and, fortunately for us, a member of our Insight Predict Advisory Board. I am using several of his examples with permission.

[3] Information retrieval scientists put it this way: In simple random sampling, every combination of elements from the population has the same probability of being the sample. The distinction here is probably above the level of this article (and my understanding).

[4] Yield curves are used to represent the effectiveness of a document ranking and are discussed in several other blog posts I have written (see, e.g., here, here, here and here). They can be generated from a simple random sample but we believe a systematic random sample–where you move through all the rankings–will provide a better and more representative look at your population.

In the World of Big Data, Human Judgment Comes Second, The Algorithm Rules

I read a fascinating blog post from Andrew McAfee for the Harvard Business Review. Titled “Big Data’s Biggest Challenge? Convincing People NOT to Trust Their Judgment,” the article’s primary thesis is that as the amount of data goes up, the importance of human judgment should go down.

Artificial.intelligenceDownplay human judgment? In this age, one would think that judgment is more important that ever. How can we manage in this increasingly complex world if we don’t use our judgment?

Even though it may seem counterintuitive, support for this proposition is piling up rapidly. McAfee cites numerous examples to back his argument. For one, it has been shown that parole boards do much worse than algorithms in assessing which prisoners should be sent home. Pathologists are not as good as image analysis software at diagnosing breast cancer. And, apparently a number of top American legal scholars got beat at predicting Supreme Court votes by a data-driven decision rule.

Have you heard how they finally taught computers to translate? For many years, humans tried to create more and more complicated rules to govern the translation of grammar in different languages. Microsoft and many others struggled with the problem, finding they could only get so far with this largely human-based approach. The resulting translations were sometimes passable but more often comical, using the humans to articulate the rules of the road.

Franz Josef Och, a research scientist at Google, tried a different approach. Rather than try to define language through rules and grammar, he simply tossed a couple billion translations at the computer to see what would happen. The result was a huge leap forward in the accuracy of computerized translation and a model that most other companies (including Microsoft) follow today. You can read more here and more about these kinds of stories in the book, Big Data: A Revolution That Will Transform How We Live, Work and Think.

What’s this have to do with legal search?

It turns out a lot. McAfee reaches the surprising conclusion that humans need to play second fiddle to the algorithms when it comes to big data. Despite our intuition, the purpose of data analytics is not to assist humans in exercising their intuition. As much as that makes sense to us carbon-based units, we simply don’t do as good a job in many situations even when presented with the insights that algorithms can provide. We tend to dismiss them in favor of our emotions and biases. At best:

What you usually see is [that] the judgment of the aided experts is somewhere in between the model and the unaided expert. So the experts get better if you give them the model. But still the model by itself performs better.

(Citing sociologist Chris Snijders, quoted in the Ian Ayres book, Super Crunchers: Why Thinking-by-Numbers is the New Way to be Smart.)

We need to flip our bias on its head, McAfee argues. Rather than have the algorithm aid the expert, the better approach is to have the expert assist the algorithm. It turns out that the results get better when the expert lends his or her judgment to the computer algorithm rather than vice versa. As Ayres put it in Super Crunchers:

Instead of having the statistics as a servant to expert choice, the expert becomes a servant of the statistical machine.

Here is the fun part. It turns out that lawyers are at the forefront of this trend. “How so?” you ask. Easy, I respond. Technology-assisted review.

Although TAR vendors use different algorithms and even different approaches, the lawyer serves the algorithm and not vice versa. For traditional TAR, we look to subject matter experts to train the algorithm. To train the algorithm, they review documents and tag them as relevant or not. The algorithm uses their judgments to assist in building its rankings. But the order is clear: The experts are serving the algorithm and not the other way around.

Even if we use review teams instead of experts for training, as I have considered in several recent articles (see here and here), the pecking order doesn’t change. The reviewers are working for the algorithm to support its efforts to analyze big data. The algorithm and not the reviewers is ultimately the decision maker for the ranking; we play only a supporting role.

Several studies have documented the superiority of TAR over human judgment. In a study published in 2011, Maura Grossman of Wachtell, Lipton, Rosen & Katz and Prof. Gordon Cormack of the University of Waterloo, concluded, “[T]he myth that exhaustive manual review is the most effective—and therefore the most defensible—approach to document review is strongly refuted. TAR can (and does) yield more accurate results than exhaustive manual review, with much lower effort.”

Earlier, in their 2009 study for the Electronic Discovery Institute, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Herbert L. Roitblat, Anne Kershaw and Patrick Oot compared human review to TAR. “On every measure,” they concluded, “the performance of the two computer systems was at least as accurate … as that of a human re-review.”

But while these studies only hinted at McAfee’s thesis, the further evolution of TAR technology and the further growth of big data have made it explicitly clear. It turns out that even the legal profession, with its reverence for tradition, is no longer immune from these evolutionary trends. Big data demands new methods and new masters, and legal is no exception. All we can do is listen and learn and move with the times.

dog-at-computerThe Future

Long ago a wit proclaimed that the law office of the future would have a lawyer, a dog and a computer. The lawyer would be there to turn on the computer in the morning. The dog was there to keep the lawyer away from the computer for the rest of the day.

I wonder if that fellow was thinking about Big Data and the evolution of our information society? If not, he came pretty close in his prediction. The dog just gave way to a smart algorithm. We call ours Fido.