Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn
Follow Us:
Technology, Techniques and Best Practices
John Tredennick

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.
Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer's Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer's Guide to Microsoft Excel 2007 (ABA Press 2009).

John has been widely honored for his achievements. In 2013, he was named by the American Lawyer as one of the top six “E-Discovery Trailblazers” in their special issue on the “Top Fifty Big Law Innovators” in the past fifty years. In 2012, he was named to the FastCase 50, which recognizes the smartest, most courageous innovators, techies, visionaries and leaders in the law. London's CityTech magazine named him one of the "Top 100 Global Technology Leaders." In 2009, he was named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region. Also in 2009, he was named the Top Technology Entrepreneur by the Colorado Software and Internet Association.

John is the former chair of the ABA's Law Practice Management Section. For many years, he was editor-in-chief of the ABA's Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents. In his spare time, you will find him competing on the national equestrian show jumping circuit.

Using TAR in International Litigation: Does Predictive Coding Work for Non-English Languages?

[This article originally appeared in the Winter 2014 issue of EDDE Journal, a publication of the E-Discovery and Digital Evidence Committee of the ABA Section of Science and Technology Law.]

Although still relatively new, technology-assisted review (TAR) has become a game changer for electronic discovery. This is no surprise. With digital content exploding at unimagined rates, the cost of review has skyrocketed, now accounting for over 70% of discovery costs. In this environment, a process that promises to cut review costs is sure to draw interest, as TAR, indeed, has.

flag-24502_640Called by various names—including predictive coding, predictive ranking, and computer-assisted review—TAR has become a central consideration for clients facing large-scale document review. It originally gained favor for use in pre-production reviews, providing a statistical basis to cut review time by half or more. It gained further momentum in 2012, when federal and state courts first recognized the legal validity of the process.

More recently, lawyers have realized that TAR also has great value for purposes other than preparing a production. For one, it can help you quickly find the most relevant documents in productions you receive from an opposing party. TAR can be useful for early case assessment, for regulatory investigations and even in situations where your goal is only to speed up the production process through prioritized review. In each case, TAR has proven to save time and money, often in substantial amounts.

But what about non-English language documents? For TAR to be useful in international litigation, it needs to work for languages other than English. Although English is used widely around the world,[1] it is not the only language you will see if you get involved in multi-national litigation, arbitration or regulatory investigations. Chinese, Japanese and Korean will be common for Asian transactions; German, French, Spanish, Russian, Arabic and Hebrew will be found for matters involving European or Middle Eastern nations. Will TAR work for documents in these languages?

download-pdf-versionMany industry professionals doubted that TAR would work on non-English documents. They reasoned that the TAR process was about “understanding” the meaning of documents. It followed that unless the system could understand the documents—and presumably computers understand English—the process wouldn’t be effective.

The doubters were wrong. Computers don’t actually understand documents; they simply catalog the words in documents. More accurately, we call what they recognize “tokens,” because often the fragments (numbers, misspellings, acronyms and simple gibberish) are not even words. The question, then, is whether computers can recognize tokens (words or otherwise) when they appear in other languages.

The simple answer is yes. If the documents are processed properly, TAR can be just as effective for non-English as it is for English documents. After a brief introduction to TAR and how it works, I will show you how this can be the case. We will close with a case study using TAR for Japanese documents.

What is TAR?

TAR is a process through which one or more humans interact with a computer to train it to find relevant documents. Just as there are many names for the process, there are many variations of it. For simplicity’s sake, I will use Magistrate Judge Andrew J. Peck’s definition in Da Silva Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012), the first case to approve TAR as a method to shape document review:

By computer assisted review, I mean tools that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with a human reviewer.

It is about as simple as that:

  1. A human (subject matter expert, often a lawyer) sits down at a computer and looks at a subset of documents.
  2. For each, the lawyer records a thumbs-up or thumbs-down decision (tagging the document). The TAR algorithm watches carefully, learning during this training.
  3. When the training session is complete, we let the system rank and divide the full set of documents between (predicted) relevant and irrelevant.[2]
  4. We then review the relevant documents, ignoring the rest.

The benefits from this process are easy to see. Let’s say you started with a million documents that otherwise would have to be reviewed by your team. If the computer algorithm predicted with the requisite degree of confidence that 700,000 are likely not-relevant, you could then exclude them from the review for a huge savings in review costs. That is a great result, particularly if you are the one paying the bills. At four dollars a document for review (to pick a figure), you just saved $2.8 million. And the courts say this is permissible.

How is TAR Used?

TAR can be used for several purposes. The classic use is to prioritize the review process, typically in anticipation of an outgoing production. You use TAR to sort the documents in order of likely relevance. The reviewers do their work in that order, presumably reviewing the most likely relevant ones first. When they get to a point where the number of relevant documents drops significantly, suggesting that they have seen most of them, the review stops. Somebody then samples the unreviewed documents to confirm that the number of relevant documents remaining is sufficiently low to justify discontinuing further, often expensive, review.

We can see the benefits of a TAR process through the following chart, which is known as a yield curve:

YieldCurve1

A yield curve presents the results of a ranking process and is a handy way to visualize the difference between two processes. The X axis shows the percentage of documents that are available for review. The Y axis shows the percentage of relevant documents found at each point in the review.

As a base line, I created a gray diagonal line to show the progress of a linear review (which essentially moves through the documents in random order). Without a better means for ordering the documents by relevance, the recall rates for a linear review typically match the percentage of documents actually reviewed¾hence the straight line. By the time you have seen 80% of the documents, you probably have seen 80% of the relevant documents.

The blue line shows the progress of a TAR review. Because the documents are ranked in order of likely relevance, you see more relevant documents at the front end of your review. Following the blue line up the Y axis, you can see that you would reach 50% recall (have viewed 50% of the relevant documents) after about 5% of your review. You would have seen 80% of the relevant documents after reviewing just 10% of the total review population.

This is a big deal. If you use TAR to organize your review, you can dramatically improve the speed at which you find relevant documents over a linear review process. Assuming the judge will let you stop your review after you find 80% of the documents (and some courts have indicated this is a valid stopping point), review savings can be substantial.

You can also use this process for other purposes. Analyzing inbound productions is one good example. These are often received shortly before depositions begin. If you receive a million or so documents in a production, how are you to quickly find which ones are important and which are not?

Here is an example where counsel reviewed about 200,000 documents received not long before depositions commenced and found about 5,700 which were “hot.” Using a small set of their own judgments about the documents for training, we were able to demonstrate that they would have found the same number of hot documents after reviewing only 38,000 documents. They could have stopped there and avoided the costs of reviewing the remaining 120,000 documents.

YieldCurve2

You can also use this process for early case assessment, using the ranking engine to place a higher number of relevant documents at the front of the stack.

What about non-English Documents?

To understand why TAR can work with non-English documents, you need to know two basic points:

  1. TAR doesn’t understand English or any other language. It uses an algorithm to associate words with relevant or irrelevant documents.
  2. To use the process for non-English documents, particularly those in Chinese and Japanese, the system has to first tokenize the document text so it can identify individual words.

We will hit these topics in order.

1. TAR Doesn’t Understand English

It is beyond the province of this article to provide a detailed explanation of how TAR works, but a basic explanation will suffice for our purposes. Let me start with this: TAR doesn’t understand English or the actual meaning of documents. Rather, it simply analyzes words algorithmically according to their frequency in relevant documents compared to their frequency in irrelevant documents.

Think of it this way. We train the system by marking documents as relevant or irrelevant. When I mark a document relevant, the computer algorithm analyzes the words in that document and ranks them based on frequency, proximity or some other such basis. When I mark a document irrelevant, the algorithm does the same, this time giving the words a negative score. At the end of the training process, the computer sums up the analysis from the individual training documents and uses that information to build a search against a larger set of documents.

While different algorithms work differently, think of the TAR system as creating huge searches using the words developed during training. It might use 10,000 positive terms, with each ranked for importance. It might similarly use 10,000 negative terms, with each ranked in a similar way. The search results would come up in an ordered fashion sorted by importance, with the most likely relevant ones coming first.

None of this requires that the computer know English or the meaning of the documents or even the words in them. All the computer needs to know is which words are contained in which documents.

2. If Documents are Properly Tokenized, the TAR Process Will Work.

Tokenization may be an unfamiliar term to you but it is not difficult to understand. When a computer processes documents for search, it pulls out all of the words and places them in a combined index. When you run a search, the computer doesn’t go through all of your documents one by one. Rather, it goes to an ordered index of terms to find out which documents contain which terms. That’s why search works so quickly. Even Google works this way, using huge indexes of words.

As I mentioned, however, the computer doesn’t understand words or even that a word is a word. Rather, for English documents it identifies a word as a series of characters separated by spaces or punctuation marks. Thus, it recognizes the words in this sentence because each has a space (or a comma) before and after it. Because not every group of characters is necessarily an actual “word,” information retrieval scientists call these groupings “tokens,” and the act of identifying these tokens for the index as “tokenization.”

All of these are tokens:

  • Bank
  • door
  • 12345
  • barnyard
  • mixxpelling

And so on. All of these will be kept in a token index for fast search and retrieval.

Certain languages, such as Chinese and Japanese, don’t delineate words with spaces or western punctuation. Rather, their characters run through the line break, often with no breaks at all. It is up to the reader to tokenize the sentences in order to understand their meaning.

Many early English-language search systems couldn’t tokenize Asian text, resulting in search results that often were less than desirable. More advanced search systems, like the one we chose for Catalyst, had special tokenization engines which were designed to index these Asian languages and many others that don’t follow the Western conventions. They provided more accurate search results than did their less-advanced counterparts.

Similarly, the first TAR systems were focused on English-language documents and could not process Asian text. At Catalyst, we added a text tokenizer to make sure that we handled these languages properly. As a result, our TAR system can analyze Chinese and Japanese documents just as if they were in English. Word frequency counts are just as effective for these documents and the resulting rankings are as effective as well.

A Case Study to Prove the Point.

Let me illustrate this with an example from a matter we handled not long ago. We were contacted by a major U.S. law firm that was facing review of a set of mixed Japanese and English language documents. It wanted to use TAR on the Japanese documents, with the goal of cutting both the cost and time of the review, but was uncertain whether TAR would work with Japanese.

Our solution to this problem was to first tokenize the Japanese documents before beginning the TAR process. Our method of tokenization—also called segmentation—extracts the Japanese text and then uses language-identification software to break it into words and phrases that the TAR engine can identify.

To achieve this, we loaded the Japanese documents into our review platform. As we loaded the documents, we performed language detection and extracted the Japanese text. Then, using our proprietary technology and methods, we tokenized the text so the system would be able to analyze the Japanese words and phrases.

With tokenization complete, we could begin the TAR process. In this case, senior lawyers from the firm reviewed 500 documents to create a reference set to be used by the system for its analysis. Next, they reviewed a sample set of 600 documents, marking them relevant or non-relevant. These documents were then used to train the system so it could distinguish between likely relevant and likely non-relevant documents and use that information for ranking.

After the initial review, and based on the training set, we directed the system to rank the remainder of the documents for relevance. The results were compelling:

  • The system was able to identify a high percentage of likely relevant documents (98%) and place them at the front of the review queue through its ranking process. As a result, the review team would need to review only about half of the total document population (48%) to cover the bulk of the likely relevant documents.
  • The remaining portion of the documents (52%) contained a small percentage of likely relevant documents. The review team reviewed a random sample from this portion and found only 3% were likely relevant. This low percentage suggested that these documents did not need to be reviewed, thus saving the cost of reviewing over half the documents.

By applying tokenization before beginning the TAR process, the law firm was able to target its review toward the most-likely relevant documents and to reduce the total number of documents that needed to be reviewed or translated by more than half.

Conclusion

As corporations grow increasingly global, legal matters are increasingly likely to involve non-English language documents. Many believed that TAR was not up to the task of analyzing non-English documents. The truth, however, is that with the proper technology and expertise, TAR can be used with any language, even difficult Asian languages such as Chinese and Japanese.

Whether for English or non-English documents, the benefits of TAR are the same. By using computer algorithms to rank documents by relevance, lawyers can review the most important documents first, review far fewer documents overall, and ultimately cut both the cost and time of review. In the end, that is something their clients will understand, no matter what language they speak.

 


[1] It is, for example, the language used in almost every commercial deal involving more than one country.

[2] Relevant in this case means relevant to the issues under review. TAR systems are often used to find responsive documents but they can be used for other inquiries such as privileged, hot or relevant to a particular issue.

Predictive Ranking (TAR) for Smart People

Predictive Ranking, aka predictive coding or technology-assisted review, has revolutionized electronic discovery–at least in mindshare if not actual use. It now dominates the dais for discovery programs, and has since 2012 when the first judicial decisions approving the process came out. Its promise of dramatically reduced review costs is top of mind today for general counsel. For review companies, the worry is about declining business once these concepts really take hold.

download-pdf-versionWhile there are several “Predictive Coding for Dummies” books on the market, I still see a lot of confusion among my colleagues about how this process works. To be sure, the mathematics are complicated, but the techniques and workflow are not that difficult to understand. I write this article with the hope of clarifying some of the more basic questions about TAR methodologies.

I spent over 20 years as a trial lawyer and partner at a national law firm and another 15 at Catalyst. During that time, I met a lot of smart people–but few actual dummies. This article is for smart lawyers and legal professionals who want to learn more about TAR. Of course, you dummies are welcome to read it too.

What is Predictive Ranking?

Predictive Ranking is our name for an interactive process whereby humans train a computer algorithm to identify useful (relevant) documents. We call it Predictive Ranking because the goal of these systems is to rank documents in order of estimated relevance. Humans do the actual coding.

How does it work?

In its simplest form, it works like the Pandora Internet radio service. Pandora has thousands of songs in its archive but no idea what kind of music you like. Its goal is to play music from your favorite artists but also to present new songs you might like as well.

Pandora

How does Pandora do this? For those who haven’t tried it, you start by giving Pandora the name of one or more artists you like, thus creating a “station.” Pandora begins by playing a song or two by the artists you have selected. Then, it chooses a similar song or artist you didn’t select to see if you like it. You answer by clicking a “thumbs up” or “thumbs down” button. Information retrieval (IR) scientists call this “relevance feedback.”

Pandora analyzes the songs you like, as well as the songs you don’t to make its suggestions. It looks at factors such as melody, harmony, rhythm, form, composition and lyrics to find similar songs. As you give it feedback on its suggestions, it takes that information into account in order to make better selections the next time. The IR people would call this “training.”

The process continues as you listen to your radio station. The more feedback you provide, the smarter the system gets. The end result is Pandora plays a lot of music you like and, occasionally, something you don’t like.

Predictive Ranking works in a similar way–only you work with documents rather than songs. As you train the system, it gets smarter about which documents are relevant to your inquiry and which are not.[1] It is as simple as that.

OK, but how does Predictive Ranking really work?

Well, it really is just like Pandora, although there are a few more options and strategies to consider. Also, different vendors approach the process in different ways, which can cause some confusion. But here is a start toward explaining the process.

1. Collect the documents you want to review and feed them to the computer.

To start, the computer has to analyze the documents you want to review (or not review), just like Pandora needs to analyze all the music it maintains. While approaches vary, most systems analyze the words in your documents in terms of frequency in the document and across the population.

Some systems require that you collect all of the documents before you begin training. Others, like our system, allow you to add documents during the training process. Either approach works. It is just a matter of convenience.

2. Start training/review.

You have two choices here. You can start by presenting documents you know are relevant (or non-relevant) to the computer or you can let the computer select documents for your consideration. With Pandora, you typically start by identifying an artist you like. This gives the computer a head start on your preferences. In theory, you could let Pandora select music randomly to see if you liked it but this would be pretty inefficient.

Either way, you essentially begin by giving the computer examples of which documents you like (relevant) and which you don’t like (non-relevant).[2] The system learns from the examples which terms tend to occur in relevant documents and which in non-relevant ones. It then develops a mathematical formula to help it predict the relevance of other documents in the population.

There is an ongoing debate about whether training requires the examples to be provided by subject matter experts (SMEs) to be effective. Our research suggests that review teams assisted by SMEs are just as effective as SMEs alone. See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?  Others disagree. See, for example, Ralph Losey’s posts about the need for SME’s to make the process effective.

Insight Predict

3. Rank the documents by relevance.

This is the heart of the process. Based on the training you have provided, the system creates a formula which it uses to rank (order) your documents by estimated relevance.

4. Continue training/review (rinse and repeat).

Continue training using your SME or review team. Many systems will suggest additional documents for training, which will help the algorithm get better at understanding your document population. For the most part, the more training/review you do, the better the system will be at ranking the unseen documents.

5. Test the ranking.

How good a job did the system do on the ranking? If the ranking is “good enough,” move forward and finish your review. If it is not, continue your training.

Some systems view training as a process separate from review. Following this approach, your SME’s would handle the training until they were satisfied that the algorithm was fully trained. They would then let the review teams look at the higher-ranked documents, possibly discarding those below a certain threshold as non-relevant.

Our research suggests that a continuous learning process is more effective. We therefore recommend that you feed reviewer judgments back to the system for a process of continuous learning. As a result, the algorithm continues to get smarter, which can mean even fewer documents need to be reviewed. See: TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?

6. Finish the review.

The end goal is to finish the review as efficiently and cost-effectively as possible. In a linear review, you typically review all of the documents in the population. In a predictive review, you can stop well before then because the important documents have been moved to the front of the queue. You save on both review costs and the time it takes to complete the review.

Ultimately, “finishing” means reviewing down the ranking until you have found enough relevant documents, with the concept of proportionality taking center stage. Thus, you stop after reviewing the first 20% of the ranking because you have found 80% of the relevant documents. Your argument is that the cost to review the remaining 80% of the document population just to find the remaining 20% of the relevant documents is unduly burdensome.[3]

That’s all there is to it. While there are innumerable choices in applying the process to a real case, the rest is just strategy and execution.

How do I know if the process is successful?

That, of course, is the million-dollar question. Fortunately, the answer is relatively easy.

The process succeeds to the extent that the document ranking places more relevant documents at the front of the pack than you might get when the documents are ordered by other means (e.g. by date or Bates number). How successful you are depends on the degree to which the Predictive Ranking is better than what you might get using your traditional approach.

Let me offer an example. Imagine your documents are represented by a series of cells, as in the below diagram. The orange cells represent relevant documents and the white cells non-relevant.

Random Docs

What we have is essentially a random distribution, or at least there is no discernable pattern between relevant and non-relevant. In that regard, this might be similar to a review case where you ordered documents by Bates number or date. In most cases, there is no reason to expect that relevant documents would appear at the front of the order.

This is typical of a linear review. If you review 10% of the documents, you likely will find 10% of the relevant documents. If you review 50%, you will likely find 50% of the relevant documents.

Take a look at this next diagram. It represents the outcome of a perfect ordering. The relevant documents come first followed by non-relevant documents.

Perfect Docs

If you could be confident that the ranking worked perfectly, as in this example, it is easy to see the benefit of ordering by rank. Rather than review all of the documents to find relevant ones, you could simply review the first 20% and be done. You could confidently ignore the remaining 80% (perhaps after sampling them) or, at least, direct them to a lower-priced review team.

Yes, but what is the ranking really like?

Since this is directed at smart people, I am sure you realize that computer rankings are never that good. At the same time, they are rarely (if ever) as bad as you might see in a linear review.

Following our earlier examples, here is how the actual ranking might look using Predictive Ranking:

Actual Docs

We see that the algorithm certainly improved on the random distribution, although it is far from perfect. We have 30% of the relevant documents at the top of the order, followed by an increasing mix of non-relevant documents. At about a third of the way into the review, you would start to run out of relevant documents.

This would be a success by almost any measure. If you stopped your review at the midway point, you would have seen all but one relevant document. By cutting out half the document population, you would save substantially on review costs.

How do I measure success?

If the goal of Predictive Ranking is to arrange a set of documents in order of likely relevance to a particular issue, the measure of success is the extent to which you meet that goal. Put as a question, “Am I getting more relevant documents at the start of my review than I might with my typical approach (often a linear review).”[4] If the answer is yes, then how much better?

To answer these questions, we need to take two additional steps. First, for comparison purposes, we will want to measure the “richness” of the overall document population. Second, we need to determine how effective our ranking system turned out to be against the entire document population.

1. Estimating richness: Richness is a measure of how many relevant documents are in your total document population. Some people call this “prevalence,” as a reference to how prevalent relevant documents are in the total population. For example, we might estimate that 15% or the documents are relevant, with 85% non-relevant. Or we might say document prevalence is 15%.

How do we estimate richness? Once the documents are assembled, we can use random sampling for this purpose. In general, a random sample allows us to look at a small subset of the document population, and make predictions about the nature of the larger set.[5] Thus, from the example above, if our sample found 15 documents out of a hundred to be relevant, we would project a richness of 15%. Extrapolating that to the larger population (100,000 for example), we might estimate that there were about 15,000 relevant documents to be found.

For those really smart people who understand statistics, I am skipping a discussion about confidence intervals and margins of error. Let me just say that the larger the sample size, the more confident you can be in your estimate. But, surprisingly, the sample size does not have to be that large to provide a high degree of confidence.

Systematic Random2. Evaluating the ranking: Once the documents are ranked, we can then sample the ranking to determine how well our algorithm did in pushing relevant documents to the top of the stack. We do this through a systematic random sample.

In a systematic random sample, we sample the documents in their ranked order, tagging them as relevant or non-relevant as we go. Specifically, we sample every Nth document from the top to the bottom of the ranking (e.g. every 100th document). Using this method helps ensure that we are looking at documents across the ranking spectrum, from highest to lowest.

As an aside, you can actually use a systematic random sample to determine overall richness/prevalence and to evaluate the ranking. Unless you need an initial richness estimate, say for review planning purposes, we recommend you do both steps at the same time.

You can read more about simple and systematic random sampling in an earlier article I wrote, Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

Comparing the results

We can compare the results of the systematic random sample to the richness of our population by plotting what scientists call a “yield curve.” While this may sound daunting, it is really rather simple. It is the one diagram you should know about if you are going to use Predictive Ranking.

Linear Yield Curve

A yield curve can be used to show the progress of a review and the results it yields, at least in number of relevant documents found. The X axis shows the percentage of documents to be reviewed (or reviewed). The Y axis shows the percentage of relevant documents found (or you would expect to fin) at any given point in the review.

Linear review: Knowing that the document population is 15% rich (give or take) provides a useful baseline against which we can measure the success of our Predictive Ranking effort. We plot richness as a diagonal line going from zero to 100%. It reflects the fact that, in a linear review, we expect the percentage of relevant documents to correlate to the percentage of total documents reviewed.

Following that notion, we can estimate that if the team were to review 10% of the document population, they would likely see 10% of the relevant documents. If they were to look at 50% of the documents, we would expect them to find 50% of the relevant documents, give or take. If they wanted to find 80% of the relevant documents, they would have to look at 80% of the entire population.

Predictive Review: Now let’s plot the results of our systematic random sample. The purpose is to show how the review might progress if we reviewed documents in a ranked order, from likely relevant to likely non-relevant. We can easily compare it to a linear review to measure the success of the Predictive Ranking process.

Predictive Yield Curve

You can quickly see that the line for the Predictive Review goes up more steeply than the one for linear review. This reflects the fact that in a Predictive Review the team starts with the most likely relevant documents. The line continues to rise until you hit the 80% relevant mark, which happens after a review of about 10-12% of the entire document population. The slope then flattens, particularly as you cross the 90% relevant line. That reflects the fact that you won’t find as many relevant documents from that point onward. Put another way, you will have to look through a lot more documents before you find your next relevant one.

We now have what we need to measure the success of our Predictive Ranking project. To recap, we needed:

  1. A richness estimate so we have an idea of how many relevant documents are in the population.
  2. A systematic random sample so we can estimate how many relevant documents got pushed to the front of the ordering.

It is now relatively easy to quantify success. As the yield curve illustrates, if I engage in a Predictive Review, I will find about 80% of the relevant documents after only reviewing about 12% of total documents. If I wanted to review 90% of the relevant documents, I could stop after reviewing just over 20% of the population. My measure of success would be the savings achieved over a linear review.[6]

At this point we move into proportionality arguments. What is the right stopping point for our case? The answer depends on the needs of your case, the nature of the documents and any stipulated protocols among the parties. At the least, the yield curve helps you frame the argument in a meaningful way.

Moving to the advanced class

My next post will take this discussion to a higher level, talking about some of the advanced questions that dog our industry. For a sneak peak on my thinking, take a look at a few of the articles we have already posted on the results of our research. I think you now have a foundation upon which to understand these and just about any other article on the topic you might find.

I hope this was helpful. Post your questions below. I will try and answer them (or pass them on to our advisory board for their thoughts).

Further reading:


[1] IR specialists call these documents “relevant” but they do not mean relevant in a legal sense. They mean important to your inquiry even though you may not plan on introducing them at trial. You could substitute hot, responsive, privileged or some other criterion depending on the nature of your review.

[2] I could use “irrelevant” but that has a different shade of meaning for the IR people so I bow to their use of non-relevant here. Either word works for this discussion.

[3] Sometimes at the meet-and-confer, the parties agree on Predictive Ranking protocols, including the relevance score that will serve as the cut-off for review.

[4] I will use a linear review (essentially a random relevance ordering) as a baseline because that is the way most reviews are done. If you review based on conceptual clusters or some other method, your baseline for comparison would be different.

[5] Note that an estimate based on a random sample is not valid unless you are sampling against the entire population. If you get new documents, you have to redo your sample.

[6] In a separate post we will argue that the true measure of success with Predictive Ranking is the total amount saved on the review, taking into consideration software and hardware along with human costs. Time savings is also an important factor. IR scientist William Webber has touched on this point here: Total annotation cost should guide automated review.

The Five Myths of Technology Assisted Review, Revisited

Tar PitOn Jan. 24, Law Technology News published John’s article, “Five Myths about Technology Assisted Review.” The article challenged several conventional assumptions about the predictive coding process and generated a lot of interest and a bit of dyspepsia too. At the least, it got some good discussions going and perhaps nudged the status quo a bit in the balance.

One writer, Roe Frazer, took issue with our views in a blog post he wrote. Apparently, he tried to post his comments with Law Technology News but was unsuccessful. Instead, he posted his reaction on the blog of his company, Cicayda. We would have responded there but we don’t see a spot for replies on that blog either.

We love comments like these and the discussion that follows. This post offers our thoughts on the points raised by Mr. Frazer and we welcome replies right here for anyone interested in adding to the debate. TAR 1.0 is a challenging-enough topic to understand. When you start pushing the limits into TAR 2.0, it gets really interesting. In any event, you can’t move the industry forward without spirited debate. The more the merrier.

We will do our best to summarize Mr. Frazer’s comments and offer our responses.

1. Only One Bite at the Apple?

Mr. Frazer suggests we were “just a bit off target” on the nature of our criticism. He rightly points out that litigation is an iterative (“circular” he calls it) business.

When new information comes into a case through initial discovery, with TAR/PC you must go back and re-train the system. If a new claim or new party gets added, then a document previously coded one way may have a completely different meaning and level of importance in light of the way the data facts changed. This is even more so the case with testimony, new rounds of productions, non-party documents, heck even social media, or public databases. If this happens multiple times, you wind up reviewing a ton of documents to have any confidence in the system. Results are suspect at best. Cost savings are gone. Time is wasted. Attorneys, entrusted with actually litigating the case, do not and should not trust it, and thus smartly review even more documents on their own at high pay rates. I fail to see the value of “continuous learning”, or why this is better. It cannot be.

He might be missing our point here. Certainly he is correct when he says that more training is always needed when new issues arise, or when new documents are added to the collection. And there are different ways of doing that additional training, some of which are smarter than others. But that is the purview of Myth #4, so we’ll address it below. Let us, therefore, clarify that when we’re talking about “only one bite of the apple,” we’re talking about what happens when the collection is static and no new issues are added.

To give a little background, let us explain what we understand to be the current, gold standard TAR workflow, to which we are reacting. What we see the industry in general saying is that the way TAR works is that you get ahold of the most senior, experienced, expertise-laden individual that you can, and then you sit that person down in front of an active learning TAR training (learning) algorithm and have the person iteratively judge thousands of documents until the system “stabilizes.” Then you apply the results of that learning to your entire collection and batch out the top documents to your contract review team for final proofing. At the point you do that batching, says the industry, learning is complete, finito, over, done. Even if you trust your contract review team to judge batched-out documents, none of those judgments are ever fed back into the system, to be used for further training to improve the ranking from the algorithm.

Myth #1 says that it doesn’t have to be that way. What “continuous learning” means is that all judgments during the review should get fed back into the core algorithm to improve the quality with regard to any and all documents that have not yet received human attention. And the reason why it is better? Empirically, we’ve seen it to be better. We’ve done experiments in which we’ve trained an algorithm to “stability,” and then we’ve continued training even during the batched-out review phase – and seen that the total number of documents that need to be examined until a defensible threshold is hit continues to go down. Is there value in being able to save even more on review costs? We think that there is.

You can see some of the results of our testing on the benefits of continuous learning here.

2. Are Subject Matter Experts Required?

We understand that this is a controversial issue and that it will take time before people become comfortable with this new approach. To quote Mr. Frazer:

To the contrary, using a subject matter expert is critical to the success of litigation – that is a big reason AmLaw 200 firms get hired. Critical thinking and strategy by a human lawyer is essential to a well-designed discovery plan. The expertise leads to better training decisions and better defensibility at the outset. I thus find your discussion of human fallibility and review teams puzzling.

Document review is mind numbing and people are inconsistent in tagging which is one of the reasons for having the expert in the first place. With a subject matter expert, you are limiting the amount of fallible humans in the process. We have seen many “review lawyers” and we have yet to find one who does not need direction by a subject matter expert. One of the marketing justifications for using TAR/PC is that human review teams are average to poor at finding relevant documents – it must be worse without a subject matter expert. I do agree with your statement that “most senior attorneys… feel they have better things to do than TAR training in any event.” With this truth, you have recognized the problem with the whole system: Spend $100k+ on a review process, eat up a large portion of the client’s litigation budget, yet the expert litigation team who they hired has not looked at a single document, while review attorneys have been “training” the system? Not relying on an expert seems to contradict your point  3, ergo.

Again, the nature of this response indicates that you are approaching this from the standard TAR workflow, which is to have your most senior expert sit for a number of days and train to stability, and then never have the machine learn anything again. To dispel the notion that this workflow is the only way in which TAR can or even should work is one reason we’re introducing these myths in the first place. What we are saying in our Myth #2 is not that you would never have senior attorneys or subject matter experts involved in any way. Of course that person should train the contract reviewers.  Rather, we are saying that you can involve non-experts, non-senior attorneys in the actual training of the system and achieve results that are just as good as having *only* a senior attorney sit and train the system.  And our method dramatically lowers both your total time cost and your total monetary cost in the process.

For example, imagine a workflow in which your contract reviewers, rather than your senior attorney, do all the initial training on those thousands of documents. Then, at some point later in the process, your senior attorney steps in and re-judges a small fraction of the existing training documents. He or she corrects via the assistance of a smart algorithm only the most egregious, inconsistent training outliers and then resubmits for a final ranking. We’ve tested this workflow empirically, and found that it yields results that are just as good, if not better, than the senior attorney working alone, training every single document. (See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?)

Moreover, you can get through training more quickly, because now you have a team working in parallel, rather than an individual working serially. Add to that the fact that your senior attorney does not always have free time the moment that training needs to get done, the flexibility to bring that senior attorney in at a later point, and do a tenth of the work that he or she would otherwise have to do, and you have a recipe for success. That’s what this myth is about – the notion that the rest of the industry has, and your own response indicates, that unless a senior attorney does every action that in any way affects the training of the system, it is a recipe for disaster. It is not; that is a myth.

And again, we justify this not through appeals to authority (“that is a big reason AmLaw 200 firms get hired”), but through empirical methods. We’ve tested it out extensively. But if appeals to authority are what is needed to show that the algorithms we employ are capable of successfully supporting these alternative workflows, we can do so. Our in-house senior research scientist, Jeremy Pickens, has his PhD from one of the top two information retrieval research labs in the country, and not only holds numerous patents on the topic, but has received the best paper award at the top information retrieval conference in the world (ACM SIGIR). Blah blah blah. But we’d prefer not to have to resort to appeals to authority, because empirical validation is so much more effective.

Please note also that we in no way *force* you to use non-senior attorneys during the training process. You are of course free to work however you want to work. However, should time or money be an issue, we’ve designed our system so as to allow you to successfully and more efficiently work in a way that doesn’t only require senior attorneys or experts to do your training, exclusively.

You can see the results of our research on the use of subject matter experts here and here.

3. Must I Train on Randomly Selected Documents?

We pointed out in our article that it is a myth that TAR training can only be on random documents.

You totally glossed over bias. Every true scientific study says that judgmental sampling is fraught with bias. Research into sampling in other disciplines is that results from judgmental sampling should be accepted with extreme caution at best. It is probably even worse in litigation where “winning” is the driving force and speed is omnipresent. Yet, btw, those who advocate judgmental sampling in eDiscovery, Grossman, e.g., also advocate that the subject matter experts select the documents – this contradicts your points in 2. You make a true point about the richness of the population making it difficult to find documents, but this militates against random selection, not for it. To us this shows another reason why TAR/PC is broken. Indeed “clicking through thousands of random documents is boring” – but this begs the question. It was never fun reviewing a warehouse of banker’s documents either. But it is real darn fun when you find the one hidden document that ties everything together, and wins your case. What is boring or not fun has nothing to do with the quality of results in a civil case or criminal investigation.

I hope we have managed to clarify that Myth #2 is not actually saying that you never have to involve a senior attorney in any way, shape or form. Rather we believe that a senior attorney doesn’t have to do every single piece of the TAR training, in all forms, at all times. Once you understand this, you quickly realize that there is no contradiction between what Maura Grossman is saying and what we are saying.

If you want to do judgmental sampling, let your senior attorney and all of his or her wisdom be employed in the process of creating the search queries used to find interesting documents. But instead of requiring that senior person to then look at every single result of those search queries, let your contract reviewers comb through those. In that manner, you can involve your senior attorney where his or her skills are the most valuable and where his or her time is the most precious. It takes a lot less time to issue a few queries than it does to sit and judge thousands of documents. Are we the only vendor out there aware of the notion that the person who issues searches for the documents and who judges all the found documents doesn’t have to be the same person? We would hope not, but perhaps we are.

Now, to the issue of bias. You’re quite right to be concerned about this, and we fault the necessary brevity of our original article in not being able to go into enough detail to satisfy your valid concerns. So we would recommend reading the following article, as it goes into much more depth about how bias is overcome when you start judgmentally, and it backs up its explanations empirically: Predictive Ranking: Technology-Assisted Review Designed for the Real World.

Imagine your TAR algorithm as a seesaw. That seesaw has to be balanced, right? So you have many in the industry saying that the only way to balance it is to randomly select documents along the length of that seesaw. In that manner, you’ll approximately have the same number of docs, at the same distance from the center, on both sides of the seesaw. And the seesaw will therefore be balanced. Judgmental sampling, on the other hand, is like plopping someone down on the far end of the seesaw. That entire side sinks down, and raises the other side high into the air, throwing off the balance. Well, in that case, the best way to balance the seesaw again is to explicitly plop down another equal weight on the exact opposite end of the seesaw, bringing the entire system to equilibrium.

What we’ve designed in the Catalyst system is an algorithm that we call “contextual diversity.” “Contextual” refers to where things have already been plopped down on that seesaw. The “diversity” means “that area of the collection that is most about the things that you know that you know nothing about,” i.e. that exact opposite end of the seesaw, rather than some arbitrary, random point. Catalyst’s contextual diversity algorithm explicitly finds and models those balancing points, and surfaces those to your human judge(s) for coding. In this manner, you can both start judgmentally *and* overcome bias. We apologize that this was not as clear in the original 5 Myths article, but we hope that this explanation helps.

We go into this subject in more detail here.

4. You Can’t Start Training until You Have All of Your Documents

One of the toughest issues in TAR systems is the requirement that you collect all of your documents before you start TAR training. This limitation stems from the use of a randomly selected control set to both guide training and provide defensibility. If you add new documents to the mix (rolling uploads), they will not be represented in the control set. Thus even if you continue training with some of these new documents, your control set would be invalid and you lose defensibility.

You might have missed that point in your comments:

I think this is similar to #1 in that you are not recognizing the true criticism that things change too much in litigation. While you can start training whenever you want and there are algorithms that will allow you to conduct new rounds on top of old rounds – the real problem is that you must go back and change previous coding decisions because the nature of the case has changed. To me, this is more akin to “continuous nonproductivity” than “continuous learning.”

The way in which we natively handle rolling uploads from a defensibility standpoint is to not rely on a human-judged control set. There are other intelligent metrics we use to monitor the progress of training, so we do not abandon the need for reference, or our defensibility, altogether – just the need for expensive, human-judged reference.

The way other systems have to work, in order to keep their control set valid, is to judge another statistically valid sample of documents from the newly arrived set. And in our experience, in the cases we’ve dealt with over the past five years, there have been on average around 67 separate uploads until the collection was complete. Let’s be conservative and assume you’re dealing with a third of that – say only 20 separate uploads from start to finish. As each new upload arrives, you’re judging 500 randomly selected documents just to create a control set. 500 * 20 = 10,000. And let’s suppose your senior attorney gets through 50 documents an hour. That’s 200 hours of work just to create a defensible control set, with not even a single training document yet judged.  And since you’ve already stated that you need to hire an AmLaw 200 senior attorney to judge these documents, at $400/hour that would be $80,000. Our approach saves you that money right off the bat by being able to natively handle the control set/rolling upload issue. Plug in your own numbers if you don’t like these, but our guess is that it’ll still add up to a significant savings.

But the control set is only half of the story. The other half is the training itself. Let us distinguish if we may between an issue that changes, and a collection that changes. If it is your issue itself (i.e. your definition of responsiveness) that changes when new documents are collected, then certainly nothing we’ve explicitly said in these Five Myths will address that problem. However, if all you are talking about is the changing expression of an unchanging issue, then we are good to go.

What do we mean by the changing expression of an unchanging issue? We mean that if you’ve collected from your engineering custodians first, and started to train the system on those documents, and then suddenly a bunch of marketing custodians arrive, that doesn’t actually change the issue that you’re looking for. What responsiveness was about before is still what responsiveness is about now. However, how that issue is expressed will change. The language that the marketers use is very different than the language that the engineers use, even if they’re talking about the same responsive “aboutness.”

This is exactly why training is a problem for the standard TAR 1.0 workflow. If you’re working in a way that requires your expert to judge all the documents up front, then if the collection grows (by adding the marketing documents to the engineering collection), that expert’s work is not really applicable to the new information and you have to go back to the drawing board, selecting another set of random documents so as to avoid bias, feed those yet again to a busy, time-pressed expert, etc. That is extremely inefficient.

What we do with our continuous learning is once again employ that “contextual diversity” algorithm that we mentioned above. Let us return to the seesaw analogy. Imagine that you’ve got your seesaw, and through the training that you’ve done it is now completely balanced. Now, a new subset of (marketing) documents appears; that is like adding a third plank to the original seesaw. Clearly what happens is that now things are unbalanced again. The two existing planks sink down to the ground, and that third plank shoots up into the air. So how do we solve for that imbalance, without wasting the effort that has gone into understanding the first two planks? Again, we use our contextual diversity algorithm to find the most effective balance point, in the most efficient, direct (aka non-random) manner possible.

Contextual diversity cares neither why nor how the training over a collection of documents is imbalanced. It simply detects the most effective points that, once pressure is applied to those points, rebalance the system. It does not matter if the seesaw started with two planks and then suddenly grew a third via rolling uploads, or if the seesaw started with three planks, and someone’s judgmental sampling only hit two of those planks. In both cases, there is imbalance, and in both cases, the algorithm explicitly models and corrects for that imbalance.

You can read more about this topic here.

5. TAR Does Not Work for non-English Documents

Many people have now realized that, properly done, TAR can work for other languages including the challenging CJK (Chinese, Japanese and Korean) languages. As we explained in the article, TAR is a “mathematical process that ranks documents based on word frequency. It has no idea what the words mean.”

Mr. Frazer seems to agree but is pitching a different kind of system for TAR:

Words are the weapons of lawyers so why in the world would you use a technology that does not know what they mean? TAR & PC are, IMHO, roads of diversion (perhaps destruction in an outlier case) for the true litigator. They are born out of the need to reduce data, rather than know what is in a large data set. They ignore a far better system is one that empowers the subject matter experts, the true litigators, and even the review team to use their intelligence, experience, and unique skills to find what they need, quickly and efficiently, regardless of how big the data is. A system is needed to understand the words in documents, and read them as a human being would.

There are a lot of points that we could say in response to this, but this post is lengthy enough as it is. So let us briefly make just two points. The first is that we think Natural Language Processing (which apparently your company uses) techniques are great. There is a lot of value there. And frankly, we think that NLP techniques complement, rather than oppose, the more purely statistical techniques.

That said, our second point is simply to note that in some of the cases that we’ve dealt with here at Catalyst, we have datasets in which over 85% of the documents are computer source code. Where there is no natural language, there can be no NLP. And yet TAR still has to be able to handle those documents as well. So perhaps we should extend Myth #5 to say that it’s a myth that “TAR Does Not Work for Non-Human Language Documents.”

Conclusion

In writing the Five Myths of TAR, our point wasn’t to claim that Catalyst has the only way to address the practical limitations of early TAR systems. To the contrary, there are many approaches to technology-assisted review which a prospective user will want to consider, and some are more cost and time effective than others. Rather, our goal was to dispel certain myths that limit the utility of TAR and let people know that there are practical answers to early TAR limitations. Debating which of those answers works best should be the subject of many of these discussions. We enjoy the debate and try to learn from others as we go along.

How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

Big-Pile-of-PaperFor an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”

At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar.

Just for fun, I decided to take another look. I was curious to see what the numbers might be in 2014 with new files and perhaps new file sizes.  So I asked my team to help me with an update. Here is a report on the process we followed and what we learned.[1]

How Many Docs 2014?

For this round, we collected over 10 million native files (“documents” or “docs”) from 44 different cases. The sites themselves were not chosen for any particular reason, although we looked for a minimum of 10,000 native files on each. We also chose not to use several larger sites where clients used text files as substitutes for the original natives.

Our focus for the study was on standard office files, such as Word, Excel, PowerPoint, PDFs and email. These are generally the focus of most review and discovery efforts and seem most important to our inquiry. I will discuss several other file types a bit later in this report.

I should also note that the files used in our study had already been processed and were loaded into Catalyst Insight, our discovery repository. Thus, they had been de-NISTed, de-duped (or not depending on client requests), culled, reduced, etc. My point was not to exclude any particular part of the document population. Rather, those kinds of files don’t often make it past processing and are typically not included in a review.

That said, here is a summary of what we found when we focused on the office and standard email files.

OfficeFileSummary

The weighted average for these files comes out to 3,124 docs per gigabyte. Not surprisingly, there are wide variations in the counts for different types of files. You can see these more easily when I chart the data.

OfficeFileChart

The average in 2014 was about 20% higher than our averages in 2011 (2,500 docs per gigabyte). Does that suggest a decrease in the size of the files we create today? I doubt it. People seem to be using more and more graphical elements in their PowerPoints and Word files, which would suggest larger file sizes and lower docs per gigabyte. My guess is that we are seeing routine sampling variation here rather than some kind of trend.

EML and Text Files

We had several sites with EML files (about 2 million in total). These were extracted from Lotus Notes databases by one of our processing partners (our process would normally output to HTML rather than EML). An EML file is essentially a text file with some HTML formatting. Including the EML files will increase the averages for files per gigabyte.

We also had sites with a large number of text and HTML files. Some were chat logs, others were purchase orders and still others were product information. If your site has a lot of these kinds of files, you will see higher averages in your overall counts.

Here are the numbers we retrieved for these kinds of files.

EMLandTXTfiles

Because of the large number of EML files, the weighted average here is much higher, at just over 15,500 files per gigabyte.

Image Files

Many sites had a large number of image files. In some cases they were small GIF files associated with logos or other graphics displaying on the email itself. It appears that these files were extracted from the email during processing and treated as separate records. In our processing, we don’t normally extract these types of files but rather leave them with the original email.

In any event, here are the numbers associated with these types of files.

ImageFiles

We did not find many image files in our last study. I don’t know if these numbers reflect different collection practices, different case issues or just happened to fall in the 2014 matters.

In any event, I did not think it would be helpful to our inquiry to include image files (and especially GIF files) because they are not typically useful in a review. If you do, the number of docs per gigabyte will be affected.

What Did We Learn?

In many ways, the figures from this study confirmed my conclusions in 2011. Once again, it seems that the industry-accepted figure of 10,000 files per gigabyte is over the mark and even the lower range figure of 5,000 seems high. For the typical files being reviewed by our clients, our number is closer to 3,000.

That value changes depending on what files make up your review population. If your site has a large number of EML or text files, expect the averages to get higher. If, conversely, you have a lot of Excel files, the average can drop sharply.

In my discussion so far, I broke out the different file types in logical groupings. If we include all of the different file types in our weighted averages, the numbers come out like this:

AllFileTypes

Including all files gets us awfully close to 5,000 documents per gigabyte, which was the lower range of the industry estimates I found. If you pull out the EML files, the number drops to 3,594.39, which is midway between our 2011 estimate (2,500) and 5.000 documents per gigabyte.

Which is the right number for you? That depends on the type of files you have and what you are trying to estimate. What I can say is that for the types of office files typically seen in a review, the number isn’t 10,000 or anything close. We use a figure closer to 3,000 for our estimates.

 


[1] I wish to particularly thank Greg Berka, Catalyst’s director of application support, for helping to assemble the data used in this article. He also assisted in the 2011 study.

My Prediction for 2014: E-Discovery is Dead — Long Live Discovery!

There has been debate lately about the proper spelling of the shorthand version for electronic discovery. Is it E-Discovery or e-discovery or Ediscovery or eDiscovery? Our friends at DSIcovery recently posted on that topic and it got me thinking.

Big Dog

The big dog today is electronic discovery.

The industry seems to be of differing minds. Several of the leading legal and business publications use e-discovery, as do we. They include Law Technology News, the other ALM publications, the Wall Street Journal (see here, for example), the ABA Journal (example), Information Week (example) and Law360 (example).

Also using e-discovery are industry analysts such as Gartner and 451 Research.

A number of vendors favor the non-hyphenated versions eDiscovery or ediscovery. They include: Symantec, EPIQ, Kroll Ontrack, Recommind and HP Autonomy.

One other vendor, kCura, goes with e-discovery.

Which is It?

So, which is it? E-Discovery or eDiscovery (or some variant on the caps)? I say none of the above. It is time that we take the “E” out of e-discovery once and for all.

When I started practicing law thirty years ago,  there was no “E” in discovery. Rather, it was about exchanging paper documents prior to trial. As documents went digital, the need to consider electronic discovery arose. This new category needed a name. E-discovery seemed a perfect fit.

Today, discovery of electronic files makes up almost the entirety of this thing we call discovery. To be sure, paper documents can still be found but they are the tail that no longer wags the proverbial dog. The big dog today is electronic discovery.

I predict that in 2014 we will start to put the “D” back in Discovery, realizing that we don’t need a special category for what is now a ubiquitous process. Discovery is what this is all about, and it is all digital. Perhaps people will start calling it D-Discovery but I hope not.  Discovery sounds just fine to me.

Cast Your Vote

So, will 2014 be the year we take the “E” out of E-Discovery? That’s my bet. Dealing with electronic files is no longer a segment of the discovery process, it  *is*  the process. It is time we recognized that fact and drop the hyphen.

This is discovery after all—no more, no less. There is no longer a distinction between producing paper and electronic files (and the paper ones are all digitized anyway). Why do we need a specialized species when it has already swallowed up the entire genus?

E-Discovery is dead. Long live Discovery.

Tell me if you agree.

Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

One of the givens of traditional CAR (computer-assisted review)[1] in e-discovery is the need for random samples throughout the process. We use these samples to estimate the initial richness of the collection (specifically, how many relevant documents we might expect to see). We also use random samples for training, to make sure we don’t bias the training process through our own ideas about what is and is not relevant.

John Tredennick CarLater in the process, we use simple random samples to determine whether our CAR succeeded. We sample the discards to ensure that we have not overlooked too many relevant documents.

But is that the best route for our CAR? Maybe not. Our road map leads us to believe a process called a systematic random sampling will get you to your destination faster with fewer stops. In this post, I will tell you why.[2]

About Sampling

Even we simpleton lawyers (J.D.s rather than Ph.D.s) know something about sampling. Sampling is the process by which we examine a small part of a population in the hope that our findings will be representative of the larger population. It’s what we do for elections. It’s what we do for QC processes. It’s what we do with a box of chocolates (albeit with a different purpose).

Academics call it “probability sampling” because the goal is to determine the probability of the sample matching the large population. It is so called because every element has some known probability of being sampled (which then allows us to make probabilistic statements about the likelihood that the sample is representative of the larger population).

There are several ways to do this including simple random, systematic and stratified sampling. For this article, my focus is on the first two methods: simple random and systematic.

Simple Random Sampling

The most basic form of sampling is “simple random sampling.” The key here is to employ a sampling process that ensures that each member of the sampled population has an equal chance of being selected.[3] With documents, we do this with a random number generator and a unique ID for each file. The random number generator is used to select IDs in a random order.

I am not going to go into confidence intervals, margins of error or other aspects of the predictive side of sampling. Suffice it to say that the size of your random sample helps determine your confidence about how well the sample results will match the larger population. That is a fun topic as well but my focus today is on types of sampling rather than the size of the sampling population needed to draw different conclusions.

Systematic Random Sampling

A “systematic random sample” differs from a simple random sample in two key respects. First, you need to order your population in some fashion. For people, it might be alphabetically or by size. For documents, Systemic Samplingwe order them by their relevance ranking. Any order works as long as the process is consistent and it serves your purposes.

The second step is to draw your sample in a systematic fashion. You do so by choosing every Nth person (or document) in the ranking from top to bottom. Thus, you might select every 10th person in the group to compose your sample. As long as you don’t start with the first person on the list but instead select your first person in the order randomly (say from the top ten people), your sample is a valid form of random sampling and can be used to determine the characteristics of the larger population. You can read more about all of this at Wikipedia and many more sources. Don’t just take my word for it.

Why Would I Use a Systematic Random Sample?

This is where the rubber meets the road (to overuse the metaphor). For CAR processes, there are a lot of advantages to using a systematic random sample over a random sample. Those advantages include getting a better picture of the document population and increasing your chances of finding relevant documents.

Let me start by emphasizing an important point. When you’re drawing a sample, you want it to be “representative” of the population you’re sampling. For instance, you’d like each sub-population to be fairly and proportionally represented. This particularly matters if sub-populations differ in the quality you want to measure.

Drawing a simple random sample means that we’re not, by our selection method, deliberately or negligently under-representing some subset of the population. However, it can still happen that, due to random variability, we can oversample one subset of the population, and undersample another. If the sub-populations do differ systematically, then this may skew our results. We may miss important documents.

An Example: Sports Preferences for Airport Travelers

William Webber gave me a sports example to help make the point.

Say we are sampling travelers in a major international airport to see what sports they like (perhaps to help the airport decide what sports to televise in the terminal). Now, sports preference tends to differ among countries, and airline flights go between different countries (and at different times of day you’ll tend to find people from different areas traveling).

So it would not be a good idea to just sit at one gate, and sample the first hundred people off the plane. Let’s say you’re in Singapore Airport. If you happen to pick a stop-over flight on the way from Australia to India, your sample will “discover” that almost all air travelers in the terminal are cricket fans. Or perhaps there is a lawyers’ convention in Bali, and you’ve picked a flight from the United States, then your study might convince the airport to show American football around the clock.

Let’s say instead that you are able to draw a purely random sample of travelers (perhaps through boarding passes–let’s not worry about the practicality of getting to these randomly sampled individuals). You’ll get a better spread, but you might tend to bunch up on some flights, and miss others–perhaps 50% more samples on the Australian-India flight, and 50% fewer on the U.S.-Bali one.

This might be particularly unfortunate if some individuals were more “important” than the others. To develop the scenario, let’s say the airport also wanted to offer sports betting for profit. Then maybe American football is an important niche market, and it would be unfortunate if your random sample happened to miss those well-heeled lawyers dying to bet on that football game I am watching as I write this post.

What you’d prefer to do (and again, let’s ignore practicalities) is to spread your sample out, so that you are assured of getting an even coverage of gates and times (and even seasons of the year). Of course, your selection will still have to be random within areas, and you still might get unlucky (perhaps the lawyer you catch hates football and is crazy about croquet). But you’re more likely to get a representative sample if your approach is systematic rather than simple random.

Driving our CAR Systematically

Let’s get back in our CAR and talk about the benefit of sampling against our document ranking. In this case, the value we’re trying to estimate is “relevance” (or more exactly, something about the distribution of relevance). Here, the population differentiation is a continuous one, from the highest relevance ranking to the lowest. This differentiation is going to be strongly correlated with the value we’re trying to measure.

Highly ranked documents are more likely to be relevant than lowly ranked ones (or so we hope). So if our simple random sample happened by chance to over-sample from the top of the ranking, we’re going to overstate the total number of relevant documents in the population.

John Tredennick SportscarLikewise, if our random sample happened by chance to oversample from the bottom of the ranking, our sample might understate the relevance population. By moving sequentially through the ranking from top to bottom, a systematic random sample removes the danger of this random “bunching,” and so makes our estimate more accurate overall.

At different points in the process, we might also want information about particular parts of the ranking. First, we may be trying to pick a cutoff. That suggests we need good information about the area around our candidate cutoff point.

Second, we might wonder if relevant documents have managed to bunch in some lower part of the ranking. It would be unfortunate if our simple random sample happened not to pick any documents from this region of interest. It would mean that we might miss relevant documents.

With a systematic random sample, we are guaranteed that each area of the ranking is equally represented. That is the point of the sample, to draw from each segment in the ranking (decile for example) and see what kinds of documents live there. Indeed, if we are already determined to review the top-ranking documents, we might want to place more emphasis on the lower rankings. Or not, depending on our goals and strategy.

Either way, the point of a systematic random sampling is to ensure that we sample documents across the ranking–from top to bottom. We do so in the belief that it will provide a more representative look at our document population and give us a better basis to draw a “yield curve.”[4] To be fair, however, the document selected from that particular region might not be representative of that region. Whether you choose random or systematic, there is always the chance that you will miss important documents.

Does it Work?

Experiments have shown us that documents can bunch together in a larger population. Back in the paper days, I knew that certain custodians were likely to have the “good stuff” in their correspondence files and I always went there first in my investigation. Likewise, people generally kept types of documents together in boxes, which made review quicker. I could quickly dismiss boxes of receipts when they didn’t matter to my case while spending my time on research notebooks when they did.

Similarly, and depending on how they were collected, relevant documents are likely to be found in bunches across a digital population. After all, I keep files on my computer in folders much like I did before I had a computer. It helps with retrieval. Other people do as well. The same is true for email, which I dutifully folder to keep my inbox clear.

So, no problem if those important documents get picked up during a random sample, or even because they are similar to other documents tagged as relevant. However, sometimes they aren’t picked up. They might still be bunched together but simply fall toward the bottom of the ranking. Then you miss out on valuable documents that might be important to your case.

While no method is perfect, we believe that a systematic random sample offers a better chance that these bunches get picked up during the sampling process. The simple reason is that we are intentionally working down the ranking to make sure we see documents from all segments of the population.

From experiments, we have seen this bunching across the ranking (yield) curve. By adding training documents from these bunches, we can quickly improve the ranking, which means we find more relevant documents with less effort. Doing so means we can review fewer documents at a lower cost. The team is through more quickly as well, which is important when deadlines are tight.

Many traditional systems don’t support systematic random sampling. If that is the case with your CAR, you might want to think about an upgrade. There is no doubt that simple random sampling will get you home eventually but you might want to ride in style. Take a systematic approach for better results and leave the driving to us.

 


[1] I could use TAR (Technology Assisted Review) but it wouldn’t work as well for my title. So, today it is CAR. Either term works for me.

[2] Thanks are due to William Webber, Ph.D., who helped me by explaining many of the points raised in this article. Webber is one of a small handful of CAR experts in the marketplace and, fortunately for us, a member of our Insight Predict Advisory Board. I am using several of his examples with permission.

[3] Information retrieval scientists put it this way: In simple random sampling, every combination of elements from the population has the same probability of being the sample. The distinction here is probably above the level of this article (and my understanding).

[4] Yield curves are used to represent the effectiveness of a document ranking and are discussed in several other blog posts I have written (see, e.g., here, here, here and here). They can be generated from a simple random sample but we believe a systematic random sample–where you move through all the rankings–will provide a better and more representative look at your population.

In the World of Big Data, Human Judgment Comes Second, The Algorithm Rules

I read a fascinating blog post from Andrew McAfee for the Harvard Business Review. Titled “Big Data’s Biggest Challenge? Convincing People NOT to Trust Their Judgment,” the article’s primary thesis is that as the amount of data goes up, the importance of human judgment should go down.

Artificial.intelligenceDownplay human judgment? In this age, one would think that judgment is more important that ever. How can we manage in this increasingly complex world if we don’t use our judgment?

Even though it may seem counterintuitive, support for this proposition is piling up rapidly. McAfee cites numerous examples to back his argument. For one, it has been shown that parole boards do much worse than algorithms in assessing which prisoners should be sent home. Pathologists are not as good as image analysis software at diagnosing breast cancer. And, apparently a number of top American legal scholars got beat at predicting Supreme Court votes by a data-driven decision rule.

Have you heard how they finally taught computers to translate? For many years, humans tried to create more and more complicated rules to govern the translation of grammar in different languages. Microsoft and many others struggled with the problem, finding they could only get so far with this largely human-based approach. The resulting translations were sometimes passable but more often comical, using the humans to articulate the rules of the road.

Franz Josef Och, a research scientist at Google, tried a different approach. Rather than try to define language through rules and grammar, he simply tossed a couple billion translations at the computer to see what would happen. The result was a huge leap forward in the accuracy of computerized translation and a model that most other companies (including Microsoft) follow today. You can read more here and more about these kinds of stories in the book, Big Data: A Revolution That Will Transform How We Live, Work and Think.

What’s this have to do with legal search?

It turns out a lot. McAfee reaches the surprising conclusion that humans need to play second fiddle to the algorithms when it comes to big data. Despite our intuition, the purpose of data analytics is not to assist humans in exercising their intuition. As much as that makes sense to us carbon-based units, we simply don’t do as good a job in many situations even when presented with the insights that algorithms can provide. We tend to dismiss them in favor of our emotions and biases. At best:

What you usually see is [that] the judgment of the aided experts is somewhere in between the model and the unaided expert. So the experts get better if you give them the model. But still the model by itself performs better.

(Citing sociologist Chris Snijders, quoted in the Ian Ayres book, Super Crunchers: Why Thinking-by-Numbers is the New Way to be Smart.)

We need to flip our bias on its head, McAfee argues. Rather than have the algorithm aid the expert, the better approach is to have the expert assist the algorithm. It turns out that the results get better when the expert lends his or her judgment to the computer algorithm rather than vice versa. As Ayres put it in Super Crunchers:

Instead of having the statistics as a servant to expert choice, the expert becomes a servant of the statistical machine.

Here is the fun part. It turns out that lawyers are at the forefront of this trend. “How so?” you ask. Easy, I respond. Technology-assisted review.

Although TAR vendors use different algorithms and even different approaches, the lawyer serves the algorithm and not vice versa. For traditional TAR, we look to subject matter experts to train the algorithm. To train the algorithm, they review documents and tag them as relevant or not. The algorithm uses their judgments to assist in building its rankings. But the order is clear: The experts are serving the algorithm and not the other way around.

Even if we use review teams instead of experts for training, as I have considered in several recent articles (see here and here), the pecking order doesn’t change. The reviewers are working for the algorithm to support its efforts to analyze big data. The algorithm and not the reviewers is ultimately the decision maker for the ranking; we play only a supporting role.

Several studies have documented the superiority of TAR over human judgment. In a study published in 2011, Maura Grossman of Wachtell, Lipton, Rosen & Katz and Prof. Gordon Cormack of the University of Waterloo, concluded, “[T]he myth that exhaustive manual review is the most effective—and therefore the most defensible—approach to document review is strongly refuted. TAR can (and does) yield more accurate results than exhaustive manual review, with much lower effort.”

Earlier, in their 2009 study for the Electronic Discovery Institute, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Herbert L. Roitblat, Anne Kershaw and Patrick Oot compared human review to TAR. “On every measure,” they concluded, “the performance of the two computer systems was at least as accurate … as that of a human re-review.”

But while these studies only hinted at McAfee’s thesis, the further evolution of TAR technology and the further growth of big data have made it explicitly clear. It turns out that even the legal profession, with its reverence for tradition, is no longer immune from these evolutionary trends. Big data demands new methods and new masters, and legal is no exception. All we can do is listen and learn and move with the times.

dog-at-computerThe Future

Long ago a wit proclaimed that the law office of the future would have a lawyer, a dog and a computer. The lawyer would be there to turn on the computer in the morning. The dog was there to keep the lawyer away from the computer for the rest of the day.

I wonder if that fellow was thinking about Big Data and the evolution of our information society? If not, he came pretty close in his prediction. The dog just gave way to a smart algorithm. We call ours Fido.

Are Subject Matter Experts Really Required for TAR Training?
(A Follow-Up on TAR 2.0 Experts vs. Review Teams)

I recently wrote an article challenging the belief that subject matter experts (SMEs) are required for training the system in technology-assisted review. (See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?) This view, held by almost everyone involved with TAR, stems from the common-sense notion that consistency and, ultimately, correctness are critical to a successful ranking process. Indeed, Ralph Losey made that case eloquently in his blog post: “Less Is More: When it comes to predictive coding training, the ‘fewer reviewers the better’ – Part One.” He argues that one SME is the “gold standard” for the job.

[Download this article in a PDF version.]

Putting our science hats on, we tested this hypothesis using data from the 2010 TREC program.[1] As I reported in the earlier article, we ran tests comparing the ranking effectiveness using seeds from the topic authorities (SMEs) and the reviewers. We found that the rankings for both the SMEs and the reviewers were close–there was no clear winner although one did better than the other at different points or on different projects.

Interestingly, we found that using the experts to QC 10% of the review team’s judgments proved as effective as using experts alone and sometimes better.  We also showed that using experts for QC review was both quicker and cheaper than using experts alone for the training.

Pushing the Limits

For this post, we decided to push the limits of our earlier experiment to see what might happen if we pitted the experts against the review teams in yet another way. For this experiment, we sought out the specific training documents where the SMEs (topic authorities) expressly disagreed with the judgments by the review team. Specifically, we decided to identify documents where the experts voted one way (responsive, for example) and the reviewers the other way (non-responsive). Our plan was to use these documents, and these documents only, to train our system. We wanted to see which set of judgments would produce the better results¾those from the experts or the conflicting judgments from the review team.

To be completely clear: In these experiments, the judgments from the review team were not just slightly wrong. The review team judgments were, from the perspective of the topic authority, 100% wrong. Every single document in these training sets that the topic authority marked as responsive, the review team marked as nonresponsive, and vice versa.

Methodology

As in my earlier article, we again worked with four topics from the TREC legal track. For each topic, we sought out documents where the judgments of the experts and reviewers conflicted. We used those conflicting judgments (and only those judgments) to twice train the algorithm: once using the experts’ judgments and once using the review team’s judgments. We wanted to see which approach produced the better ranking.

Following the methodology of my earlier article, we also did a third run of the algorithm. For this one, we simulated using the experts for quality control. We assumed that the experts would review 10% of the review team’s judgments and correct them (because in this case we already knew that the experts disagreed with the review team). Thus, the third run contained a mix of the review team’s judgments (arguably mistakes) with a randomly selected 10% of them corrected to reflect the experts’ (arguably correct) judgments.

Because we are training using only documents on which the topic authorities and non-authorities disagree, the absolute level of performance on these topics is lower than if we had used every available training document, i.e. all those additional documents on which both the authorities and non-authorities agreed. That, however, is not the purpose of this experiment. Our goal was to isolate and assess relative differences between these two variables.

Sound like fun? Here is how it turned out.

Quick Primer on Yield Curves

As before, we present the results using a yield curve. A yield curve presents the results of a ranking process and is a handy way to illustrate the difference between two processes. The X axis shows the percentage of documents that are available for review. The Y axis shows the percentage of relevant documents (recall) found at each point in the review.

The higher the curve is and the closer it is to the top left corner, the better the ranking. The sharply rising curve signifies that the reviewer is presented with a higher percentage of relevant documents, which is the goal of the process.

The gray diagonal line shows the results of a random presentation of relevant documents that is the expected outcome of linear review. On average, the reviewer can expect to see 10% of the relevant documents after reviewing 10% of the total, 50% after 50%, and so on until the review is complete. It presents a baseline for our analysis because review efficiency shouldn’t get any worse than this.

Issue 1

TREC Issue 1

In this experiment, the ranking curve based on the expert judgments performed better than the one based on the review team’s judgments. The expert training hit 80% recall after reviewing about 10% of the total documents. The review team reached 80% recall at about 22% of the total population. When we based the training on the set of documents that included 10% QC correction from the SMEs, the review team would have reached 80% after reviewing about 15% of the total population.

One point is worth noting here. Even though the expert performed better than the review team, a ranking based on reviewer judgments still performed substantially better than manual linear review. Even relying solely on the review team’s training, you still only have to go about 22% of the way through the collection to get to 80% recall. With manual review, you’d have to go 80% of the way to get 80% recall.

Issue 2

TREC Issue 2

In this case, the results are similar to Issue 1, albeit with different percentages of the total document population that needs to be reviewed. Using expert judgments only, you would reach 80% recall after reviewing about 48% of the total document population. Using reviewer judgments only, you would hit 80% at just under 57% of the review. Using an expert to perform QC of review team judgments, the reviewers would have hit 80% at about 53% of the review.

Issue 3

TREC Issue 3

This case was particularly interesting. Using the same 80% recall threshold, the expert judgments and the (diametrically opposite) review team judgments brought the same results. You would only have to review about 30% of the document population regardless of whether you built the ranking on expert judgments or review team judgments.

Even more interesting, the experts’ ranking curve became worse after reaching 80% recall. The review team judgments produced a superior ranking as did the QC process.

Issue 4

TREC Issue 4

Here again, the ranking based on expert judgments did substantially worse than the ranking based on review team judgments. Specifically, the expert did significantly worse in the early stages than the review team or the review team supplemented by 10% QC review by an expert.

Interestingly, the lines converge at about the 80% point and stay together after that. So for recall thresholds above 80%, you would have gotten the same results regardless of whether you relied on an expert’s judgment or that of your review team.

It is important to note that in all of these experiments, the expert judgment as to responsiveness or nonresponsiveness of a document was used as ground truth, i.e. to draw all yield curves. Thus, even for the rankings based on review team judgments, we evaluated the quality of those rankings based on the expert judgments.

What does this all mean?

From a scientific perspective, perhaps nothing at all. We took a relatively limited number of training seeds and based our ranking off them. In two examples, rankings based on the expert’s judgments outperformed the rankings based on review team judgments. In the other two examples, the review team judgments seemed to produce as good or sometimes better rankings than the experts.

It bears repeating that we used only documents on which the topic authorities and non-authorities disagreed. For that reason, the absolute level of performance on these topics is lower than if we had used every available training document, i.e. all those additional documents on which both the authorities and non-authorities agreed. That, however, was not the purpose of these experiments.

The point of these experiments was to see what happened when things go very badly during the training process. We intentionally removed from training all documents that the experts and review teams agreed on. Instead we simply focused on what might happen with opposite judgments on documents which likely required a judgment call on relevance¾using our algorithms of course.

What we see from the experiments is that even if the review team was 100% wrong, it didn’t completely destroy the value of the ranking results. Even at their worst, review team ranking were still significantly better than linear review and often matched those based on expert judgments.

The point here is to suggest that you have options for training. Some will want to use the subject matter expert extensively for training. That approach works fine with us, our system and our algorithm.

But others may prefer a different approach. Our research suggests strongly that the expert can focus on finding useful exemplars for training, sample as much or as little as they like, and do QC during the course of the review. Meanwhile, the review team gets going right away and you can take advantage of continuous ranking. By its nature, continuous ranking will require that you use the judgments of the review team for training purposes.

It is also noteworthy in these experiments that, even when the review team’s judgments were not as effective as the expert’s, they nevertheless yielded good results. More to the point, even when they were 100 percent wrong, as they were in these experiments, they produced good results – really good results, relative to manual linear review.

This counters the suggestion of the SME-only advocates that the training documents have to be right or else the overall process is going to be completely wrong. That argument misses the point. The point isn’t that the process completely succeeds or completely fails. The point is that it is a matter of degrees.  Even when the training is 100% wrong, the process does not completely fail. To the contrary, it performs well.

One final note to the research scientists and TAR geeks who are reading this. By now, you’ve no doubt realized that we’ve spilled a bit of our secret sauce in this post. By telling you that we used judgments that were 100% wrong, and still got results better than the random baseline, we have revealed hints about our proprietary TAR technologies here at Catalyst and how we conceptually approach our algorithms. Not every algorithm is going to be able to use 100% wrong judgments and still achieve better-than-random results. Even so, we believe that the value of openly sharing our results and encouraging this discussion outweighs the loss of any of our sauce.

We will continue to look for opportunities to compare the judgments of SMEs and review teams but we hope that our initial experiments will contribute to this interesting discussion.

 


[1] The Text Retrieval Conference i(TREC) s sponsored by the National Institute for Standards and Technology. (http://trec.nist.gov/)

Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?

One of the givens of traditional technology-assisted review (“TAR”) is the notion that a subject matter expert (“SME”) is required to train the algorithm. During a recent EDRM webinar, for example, I listened to an interesting discussion about whether you could use more than one expert to train the algorithm, presumably to speed up the process. One panelist stated confidently that using four or five SMEs for training would be unworkable. (I guess they would be hard to manage.) But she wondered whether two or three experts might be OK.

[Download this article in a PDF version.]

In quick response, another speaker cautioned that consistency was the key to effective training against a reference set (a staple of traditional TAR). He cautioned that having one expert review the training documents was critical to the process.

expert_helpI found myself wanting to jump into the conversation. (I couldn’t, of course, because webinars don’t work that way.) Putting aside whether experts are consistent even in their own tagging, are we really sure that TAR training is an “experts-only” process, as is suggested by many proponents of what I call TAR 1.0?

What about having review teams assist in the training process? What if we use experts to do what they do best–find good documents to help get the ranking started? Then send the highest-ranked documents to the review team so they can start right away? Let the expert continue to find good documents through witness interviews and search techniques or even some form of random sampling. That approach would allow the review team to get going right away, rather than wait for the expert to finish the document-training process.

My thoughts were not just pipe dreams–I had seen the results of our research examining just this question. Dr. Jeremy Pickens, our Senior Research Scientist, had done experiments using the TREC data from 2010 to see whether expert training provided better rankings than could a review team. The question is important because a lot of review managers find it difficult to keep their teams waiting while a senior person finds time to look at 3,000 or more documents necessary for TAR training.[1] On top of that, the same expert is required to come back and train any time new uploads are introduced to the collection.

Dr. Pickens’ research was done using Insight Predict, our proprietary engine for predictive ranking. He did it in conjunction with our research on the benefits of continuous ranking, which I wrote about in a separate post. The goal for that work was to see if a continuous learning process might provide better ranking results and, ultimately, further reduce the number of documents necessary for review. Our conclusion was yes, that continuous ranking could save on review costs and cut the time needed for review.

Welcome to TAR 2.0, where we challenge many of the established notions of the traditional TAR 1.0 process.

Where Do Experts Fit in TAR 2.0?

If you accept the cost-saving benefits of continuous ranking, you are all but forced to ask about the role of experts. Most experts I know (often senior lawyers) don’t want to review training documents, even though they may acknowledge the value of this work in cutting review costs. They chafe at clicking through random and often irrelevant documents and put off the work whenever possible.

Often, this holds up the review process and frustrates review managers, who are under pressure to get moving as quickly as possible. New uploads are held hostage until the reluctant expert can come back to the table to review the additional seeds. Indeed, some see the need for experts as one of the bigger negatives about the TAR process.

Continuous ranking using experts would be a non-starter. Asking senior lawyers to review 3,000 or more training documents is one thing. Asking them to continue the process through 10,000, 50,000 or even more documents could lead to early retirement–yours, not theirs. I can hear it now: “I didn’t go to law school for that kind of work. Push it down to the associates or those contract reviewers we hired. That’s their job.”

So, our goal was to find out how important experts are to the training process, particularly in a TAR 2.0 world. Are their judgments essential to ensure optimal ranking or can review team judgments be just as effective? Ultimately, we wondered if experts could work hand in hand with the review team, doing tasks better suited to their expertise, and achieve better and faster training results–at less cost than using the expert exclusively for the training.

Our results were interesting, to say the least.

Research Population

We used data from the 2010 TREC program[2] for our analysis. The TREC data is built on a large volume of the ubiquitous Enron documents, which we used for our ranking analysis. We used judgments about those documents (i.e. relevant to the inquiry or not) provided by a team of contract reviewers hired by TREC for that purpose.

In many cases, we also had judgments on those same documents made by the topic authorities on each of the topics for our study. This was because the TREC participants were allowed to challenge the judgments of the contract reviewers. Once challenged, the document tag would be submitted to the appropriate topic authority for further review. These were the people who had come up with the topics in the first place and presumably knew how the documents should be tagged. We treated them as SMEs for our research.

So, we had data from the review teams and, often, from the topic authorities themselves. In some cases, the topic authority affirmed the reviewer’s decision. In other cases, they were reversed. This gave us a chance to compare the quality of the document ranking based on the review team decisions and those of the SMEs.[3]

Methodology

We worked with the four TREC topics from the legal track. These were selected essentially at random. There was nothing about the documents or the results that caused us to select one topic over the other. In each case, we used the same methodology I will describe here.

For each topic, we started by randomly selecting a subset of the overall documents that had been judged. Those became the training documents, sometimes called seeds. The remaining documents were used as evaluation (testing) documents. After we developed a ranking based on the training documents, we could test the efficacy of that ranking against the actual review tags in the larger evaluation set.[4]

As mentioned earlier, we had parallel training sets, one from the reviewers and one from the SMEs. Our random selection of documents for training included documents on which both the SME and a basic reviewer agreed, along with documents on which the parties disagreed. Again, the selection was random so we did not control how much agreement or disagreement there was in the training set.

Experts vs. Review Teams: Which Produced the Better Ranking?

We used Insight Predict to create two separate rankings. One was based on training using judgments from the experts. The other was based on training using judgments from the review team. Our idea was to see which training set resulted in a better ranking of the documents.

We tested both rankings against the actual document judgments, plotting our results in standard yield curves. In that regard, we used the judgments of the topic authorities to the extent they differed from those of the review team. Since they were the authorities on the topics, we used their judgments in evaluating the different rankings. We did not try to inject our own judgments to resolve the disagreement.

Using the Experts to QC Reviewer Judgments

As a further experiment, we created a third set of training documents to use in our ranking process. Specifically, we wanted to see what impact an expert might have on a review team’s rankings if the expert were to review and “correct” a percentage of the review team’s judgments. We were curious whether it might improve the overall rankings and how that effort might compare to rankings done by an expert or review team without the benefit of a QC process.

We started by submitting the review team’s judgments to Predict. We then asked Predict to rank the documents in this fashion:

  1. The lowest-ranked positive judgments (reviewer tagged it relevant while  Predict ranked it highly non-relevant); and
  2. The highest-ranked negative judgments (reviewer tagged it non-relevant while Predict ranked it highly relevant).

The goal here was to select the biggest outliers for consideration. These were documents where our Predict ranking system most strongly differed from the reviewer’s judgment, no matter how the underlying documents were tagged.

We simulated having an expert look at the top 10 percent of these training documents. In cases where the expert agreed with the reviewer’s judgments, we left the tagging as is. In cases where the expert had overturned the reviewer’s judgment based on a challenge, we reversed the tag. When this process was finished, we ran the ranking again based on the changed values and plotted those values as a separate line in our yield curve.

Plotting the Differences: Expert vs. Reviewer Yield Curves

A yield curve presents the results of a ranking process and is a handy way to visualize the difference between two processes. The X axis shows the percentage of documents that are reviewed. The Y axis shows the percentage of relevant documents found at each point in the review.

Here were the results of our four experiments.

Issue One

TREC-Issue-1

The lines above show how quickly you would find relevant documents during your review. As a base line, I created a gray diagonal line to show the progress of a linear review (which essentially moves through the documents in random order). Without a better basis for ordering of the documents, the recall rates for a linear review typically match the percentage of documents actually reviewed–hence the straight line. By the time you have seen 80% of the documents, you probably have seen 80% of the relevant documents.

The blue, green and red lines are meant to show the success of the rankings for the review team, expert and the use of an expert to QC a portion of the review team’s judgments. Notice that all of the lines are above and to the left of the linear review curve. This means that you could dramatically improve the speed at which you found relevant documents over a linear review process with any of these ranking methods. Put another way, it means that a ranked review approach would present more relevant documents at any point in the review (until the end). That is not surprising because TAR is typically more effective at surfacing relevant documents than linear review.

In this first example, the review team seemed to perform at a less effective rate than the expert reviewer at lower recall rates (the blue curve is below and to the right of the other curves). The review team ranking would, for example, require the review of a slightly higher percentage of documents to achieve an 80% recall rate than the expert ranking.[5] Beyond 80%, however, the lines converge and the review team seems to do as good a job as the expert.

When the review team was assisted by the expert, through a QC process, the results were much improved. The rankings generated by the expert-only review were almost identical to the rankings produced by the review team with QC assistance from the expert. I will show later that this approach would save you both time and money, because the review team can move more quickly than a single reviewer and typically bills at a much lower rate.

Issue Two

TREC-Issue-2

In this example, the yield curves are almost identical, with the rankings by the review team being slightly better than those of an expert alone. Oddly, the expert QC rankings drop a bit around the 80% recall line and stay below until about 85%. Nonetheless, this experiment shows that all three methods are viable and will return about the same results.

Issue Three

TREC-Issue-3

In this case the ranking lines are identical until about the 80% recall level. At that point, the expert QC ranking process drops a bit and does not catch up to the expert and review team rankings until about 90% recall. Significantly, at 80% recall, all the curves are about the same. Notice that this recall threshold would only require a review of 30% of the documents, which would suggest a 70% cut in review costs and time.

Issue Four

TREC-Issue-4

Issue four offers a somewhat surprising result and may be an outlier. In this case, the expert ranking seems substantially inferior to the review team or expert QC rankings. The divergence starts at about the 55% recall rate and continues until about 95% recall. This chart suggests that the review team alone would have done better than the expert alone. However, the expert QC method would have matched the review team’s rankings as well.

What Does This All Mean?

That’s the million-dollar question. Let’s start with what it doesn’t mean. These were tests using data we had from the TREC program. We don’t have sufficient data to prove anything definitively but the results sure are interesting. It would be nice to have additional data involving expert and review team judgments to extend the analysis.

In addition, these yield curves came from our product, Insight Predict. We use a proprietary algorithm that could work differently from other TAR products. It may be that experts are the only ones suitable to train some of the other processes. Or not.

That said, these yield curves suggest strongly that the traditional notion that only an expert can train a TAR system may not be correct. On average in these experiments, the review teams did as well or better than the experts at judging training documents. We believe it provides a basis for further experimentation and discussion.

Why Does this Matter?

There are several reasons this analysis matters. They revolve around time and money.

First, in many cases, the expert isn’t available to do the initial training, at least not on your schedule. If the review team has to wait for the expert to get through 3,000 or so training documents, the delay in the review can present a problem. Litigation deadlines seem to get tighter and tighter. Getting the review going more quickly can be critical in some instances.

Second, having review teams participate in training can cut review costs. Typically, the SME charges at a much higher billing rate than a reviewer. If the expert has to review 3,000 training documents at a higher billable rate, total costs for the review increase accordingly. Here is a simple chart illustrating the point.

Savings-Chart

Using the assumptions I have presented, having an expert do all of the training would take 50 hours and cost almost $27,500. In contrast, having a review team do most of the training while the expert does a 10% QC, will reduce the cost by 85%, to $5,750. The time spent on the combined review process changes from 50 hours (6+ days) to 10 combined hours, a bit more than a day.[6]

You can use different assumptions for this chart but the point is the same. Having the review team involved in the process saves time and money. Our testing suggests that this happens with no material loss to the ranking process.

This all becomes mandatory when you move to continuous ranking. The process is based on using the review team rather than an expert for review. Any other approach would not make sense from an economic perspective or be a good or desirable use of the expert’s time.

So what should the expert do in a TAR 2.0 environment? We suggest that experts do what they are trained to do (and have been doing since our profession began). Use the initial time to interview witnesses and find important documents. Feed those documents to the ranking system to get the review started. Then use the time to QC the review teams and to search for additional good documents. Our research so far suggests that the process makes good sense from both a logical and efficiency standpoint.

 


[1] Typical processes call for an expert to train about 2,000 documents before the algorithm “stabilizes.” They also require the expert to review 500 or more documents to create a control set for testing the algorithm and a similar amount for testing the ranking results once training is complete. Insight Predict does not use a control set (the system ranks all the documents with each ranking). However, it would require a systematic sample to create a yield curve.

[2] The Text Retrieval Conference is sponsored by the National Institute for Standards and Technology. (http://trec.nist.gov/)

[3] We aren’t claiming that this perfectly modeled a review situation but it provided a reasonable basis for our experiments. In point of fact, the SME did not re-review all of the judgments made by the review team. Rather, the SME considered those judgments where a vendor appealed a review team assessment. In addition, the SMEs may have made errors in their adjudication or otherwise acted inconsistently. Of course that can happen in a real review as well. We just worked with what we had.

[4] Note that we do not consider this the ideal workflow. A completely random seed set, with no iteration and no judgmental/automated seeding, this test does not (and is not intended to) create the best yield curve. Our goal here was to put all three tests on level footing, which this methodology does.

[5] In this case, you would have to review 19% of the documents to achieve 80% recall for the ranking based only on the review team’s training and only 14% based on training by an expert.

[6] I used “net time spent” for the second part of this chart to illustrate the real impact of the time saved. While the review takes a total of 55 hours (50 for the team and 5 for the expert), the team works concurrently. Thus, the team finishes in just 5 hours, leaving the expert another 5 hours to finish his QC. The training gets done in a day (or so) rather than a week.

TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?

For all of its complexity, technology-assisted review (TAR) in its traditional form is easy to sum up:

  1. A lawyer (subject matter expert) sits down at a computer and looks at a subset of documents.
  2. For each, the lawyer records a thumbs-up or thumbs-down decision (tagging the document). The TAR algorithm watches carefully, learning during this training.
  3. When training is complete, we let the system rank and divide the full set of documents between (predicted) relevant and irrelevant.[1]
  4. We then review the relevant documents, ignoring the rest.

apple_biteThe benefits from this process are easy to see. Let’s say you started with a million documents that otherwise would have to be reviewed by your team. If the computer algorithm predicted with the requisite degree of confidence that 700,000 are likely not-relevant, you could then exclude them from the review for a huge savings in review costs. That is a great result, particularly if you are the one paying the bills.

To download a PDF version of this post, click here.

But is that it? Once you “part the waters” after the document ranking, you are stuck reviewing the 300,000 that fall on the relevant side of the cutoff. If I were the client, I would wonder whether there were steps you could take to reduce the document population even further. While reviewing 300,000 documents is better than a million, cutting that to 250,000 or fewer would be even better.

Can we reduce the review count even further?

The answer is yes, if we can change the established paradigm. TAR 1.0 was about the benefits of identifying a cutoff point after running a training process using a subject matter expert (SME). TAR 2.0 is about continuous ranking throughout the review process—using review teams as well as SMEs. As the review teams work their way through the documents, their judgments are fed back to the computer algorithm to further improve the ranking. As the ranking improves, the cutoff point is likely to improve as well. That means even fewer documents to review, at a lower cost. The work gets done more quickly as well.

It can be as simple as that!

Insight Predict is built around this idea of continuous ranking. While you can use it to run a traditional TAR process, we encourage clients to take more than one bite at the ranking apple. Start the training by finding as many relevant documents (responsive, privileged, etc.) as your team can identify. Supplement these documents (often called seeds) through random sampling, or use our contextual diversity sampling to view documents selected for their distinctiveness from documents already seen.[2]

The computer algorithm can then use these training seeds as a basis to rank your documents. Direct the top-ranked ones to the review team for their consideration.

In this scenario, the review team starts quickly, working from the top of the ranked list. As they review documents, you feed their judgments back to the system to improve the ranking, supplemented with other training documents  chosen at random or through contextual diversity. Meanwhile, the review team continues to draw from the highest-ranked documents, using the most recent ranking available. They continue until the review is complete.[3]

Does it work?

Logic tells us that continuously updated rankings will produce better results than a one-time process. As you add more training documents, the algorithm should improve. At least, that is the case with our system. While rankings based on a few thousand training documents can be quite good, they almost always improve through the addition of more training documents. As our Senior Research Scientist Jeremy Pickens says: “More is more.” And more is better.

And while more is better, it does not necessarily mean more work for the team. Our system’s ability to accept additional training documents, and to continually refine its rankings based on those additional exemplars, results in the review team having to review fewer documents, saving both time and money.

Testing the hypothesis

We decided to test our hypothesis using three different review projects. Because each had already gone through linear review, we had what Dr. Pickens calls “ground truth” about all of the records being ranked. Put another way, we already knew whether the documents were responsive or privileged (which were the goals of the different reviews).[4]

Thus, in this case we were not working with a partial sample or drawing conclusions based on a sample set. We could run the ranking process as if the documents had not been reviewed but then match up the results to the actual tags (responsive or privileged) given by the reviewers.

The process

The tests began by picking six documents at random from the total collection. We then used those documents as training seeds for an initial ranking. We then ranked all of the documents based on those six exemplars.[5]

From there, we simulated delivering new training documents to the reviewers. We included a mix of highly ranked and random documents, along with others selected for their contextual diversity (meaning they were different from anything previously selected for training). We used this technique to help ensure that the reviewers saw a diverse range of documents—hopefully improving the ranking results.

Our simulated reviewers made judgments on these new documents based on tags from the earlier linear review. We then submitted their judgments to the algorithm for further training and ranking. We continued this train-rank-review process, working in batches of 300, until we reached an appropriate recall threshold for the documents.

What do I mean by that? At each point during the iteration process, Insight Predict ranked the entire document population. Because we knew the true responsiveness of every document in the collection, we could easily track how far down in the ranking we would have to go to cover 50%, 60%, 70%, 80%, 90%, or even 95% of the relevant documents.

From there, we plotted the information to compare how many documents you would have to review using a one-time ranking process versus a continuous ranking approach. For clarity and simplicity, I chose two recall points to display: 80% (a common recall level) and 95% (high but achievable with our system). I could have presented several other recall rates as well but it might make the charts more confusing than necessary. The curves all looked similar in any event.

The research studies

Below are charts showing the results of our three case studies. These charts are different from the typical yield curves because they serve a different purpose. In this case, we were trying to demonstrate the efficacy of a continuous ranking process rather than a single ranking outcome.

Specifically, along the X-axis is the number of documents that were manually tagged and used as seeds for the process (the simulated review process). Along the Y-axis is the number of documents the review team would have to review (based on the seeds input to that point) to reach a desired recall level. The black diagonal line crossing the middle represents the simulated review counts, which were being continually fed back to the algorithm for additional training.

This will all make more sense when I walk you through the case studies. The facts of these cases are confidential, as are the clients and actual case names. But the results are highly interesting to say the least.

Research study 1: Wellington F matter (responsive review)

This case involved a review of 85,506 documents. Of those, 11,460 were judged responsive. That translates to a prevalence (richness) rate of about 13%. Here is the resulting chart from our simulated review (click chart for larger view):

wellington-f-matter2

There is a lot of information on this chart so I will take it step by step.

The black diagonal line represents the number of seeds given to our virtual reviewers. It starts at zero and continues along a linear path until it intersects the 95% recall line. After that, the line becomes dashed to reflect the documents that might be included in a linear review but would be skipped in a TAR 2.0 review.

The red line represents the number of documents the team would have to review to reach the 80% recall mark. By that I simply mean that after you reviewed that number of documents, you would have seen 80% of the relevant documents in the population. The counts (from the Y axis) range from a starting point of 85,506 documents at zero seeds (essentially a linear review)[6] to 27,488 documents (intersection with the black line) if you used continuous review.

I placed a grey dashed vertical line at the 2,500 document mark. This figure is meant to represent the number of training documents you might use to create a one-time ranking for a traditional TAR 1.0 process.[7] Some systems require a larger number of seeds for this process but the analysis is essentially the same.

Following the dashed grey line upwards, the review team using TAR 1.0 would have to review 60,161 documents to reach a recall rate of 80%. That number is lower than the 85,000+ documents that would be involved with a linear review. But it is still a lot of documents and many more than the 27,488 required using continuous ranking.

With continuous ranking, we would continue to feed training documents to the system and continually improve the yield curve. The additional seeds used in the ranking are represented by the black diagonal line as I described earlier. It continues upwards and to the right as more seeds are reviewed and then fed to the ranking system.

The key point is that the black solid line intersects the red 80% ranking curve at about 27,488 documents. At this point in the review, the review team would have seen 80% of the relevant documents in the collection. We know this is the case because we have the reviewer’s judgments on all of the documents. As I mentioned earlier, we treated those judgments as “ground truth” for this research study.[8]

What are the savings?

The savings come from the reduction of documents required to reach the 80% mark. By my calculations, the team would be able to reduce its review burden from 60,161 documents in the TAR 1.0 process to 27,488 documents in the TAR 2.0 process—a reduction of another 32,673 documents. That translates to an additional 38% reduction in review attributable to the continuous ranking process. That is not a bad result. If you figure $4 a document for review costs,[9] that would come to about $130,692 in additional savings.

It is worth mentioning that total savings from the TAR process are even greater. If we can reduce the total document population from 85,506 to 28,000 documents, that represents a reduction of 58,018 documents, or about 68%. At $4 a document, the total savings from the TAR process comes to $232,072.

Time is Money: I would be missing the boat  if I stopped the analysis here. We all know the old expression, “Time is money.” In this case, the time savings from continuous ranking over a one-time ranking can be just as important as the savings on review costs. If we assumed your reviewer could go through 50 documents an hour, the savings for 80% recall would be a whopping 653 hours of review time avoided. At eight hours per review day, that translates to 81 review days saved.[10]

How about for 95% recall?

If you followed my description of the ranking curve for 80% recall, you can see how we would come out if our goal was to achieve 95% recall. I have placed a summary of the numbers in the chart but I will recap them here.

  1. Using 2,500 seeds and the ranking at that point, the TAR 1.0 team would have to review 77,731 documents in order to reach the 95% recall point.
  2. With TAR 2.0′s continuous ranking, the review team could drop the count to 36,215 documents for a savings of 41,516 documents. That comes to a 49% savings.
  3. At $4 a document, the savings from using continuous ranking instead of TAR 1.0 would be $166,064. The total savings over linear review would be $202,024.
  4. Using our review metrics from above, this would amount to saving 830 review hours or 103 review days.

The bottom line on this case is that continuous ranking saves a substantial amount on both review costs and review time.

Research study 2: Ocala M matter (responsive review)

This case involved a review of 57,612 documents. Of those, 11,037 were judged relevant. That translates to a prevalence rate of about 19%, a bit higher than in the Wellington F Matter.

Here is the resulting chart from our simulated review (click chart for larger view):

ocala-m-matter2

For an 80% recall threshold, the numbers are these:

  1. Using TAR 1.0 with 2,500 seeds and the ranking at that point, the team would have to review 29,758 documents in order to reach the 80% recall point.
  2. With TAR 2.0 and continuous ranking, the review team could drop the count to 23,706 documents for a savings of 6,052 documents. That would be an 11% savings.
  3. At $4 a document, the savings from the continuous ranking process would be $24,208.

Compared to linear review, continuous ranking would reduce the number of documents to review by 33,906, for a cost savings of $135,624.

For a 95% recall objective, the numbers are these:

  1. Using 2,500 seeds and the ranking at that point, the TAR 1.0 team would have to review 46,022 documents in order to reach the 95% recall point.
  2. With continuous ranking, the TAR 2.0 review team could drop the count to 31,506 documents for a savings of 14,516 documents. That comes to a 25% savings.
  3. At $4 a document, the savings from the continuous ranking process would be $58,064.

Not surprisingly, the numbers and percentages in the Ocala M study are different from the numbers in Wellington F, reflecting different documents and review issues. However, the underlying point is the same. Continuous ranking can save a substantial amount on review costs as well as review time.

Research study 3: Wellington F matter (privilege review)

The team on the Wellington F Matter also conducted a privilege review against the 85,000+ documents. We decided to see how the continuous ranking hypothesis would work for finding privileged documents. In this case, the collection was sparse. Of the 85,000+ documents, only 983 were judged to be privileged. That represents a prevalence rate of just over 1%, which is relatively low and can cause a problem for some systems.

Here is the resulting chart using the same methodology (click chart for larger view):

wellington-priv-matter3

For an 80% recall threshold, the numbers are these:

  1. The TAR 1.0 training would have finished the process after 2,104 training seeds. The team would have hit the 80% recall point at that time.
  2. There would be no gain from continuous ranking in this case because the process would be complete during the initial training.

The upshot from this study is that the team would have saved substantially over traditional means of reviewing for privilege (which would involve linear review of some portion of the documents).[11] However, there were no demonstrative savings from continuous ranking.

I recognize that most attorneys would demand a higher threshold than 80% for a privilege review. For good reasons, they would not be comfortable with allowing 20% of the privileged documents to slip through the net. The 95% threshold might bring them more comfort.

For a 95% recall objective, the numbers are these:

  1. Using 2,500 seeds and the ranking at that point, the TAR 1.0 team would have to review 18,736 documents in order to reach the 95% recall point.
  2. With continuous ranking, the TAR 2.0 review team could drop the count to 14,404 documents for a savings of 4,332 documents.
  3. At $4 a document, the savings from the continuous ranking process would be $17,328.

For actual privilege reviews, we recommend that our clients use many of the other analytics tools in Insight to make sure that confidential documents don’t fall through the net. Thus, for the documents that are not actually reviewed during the TAR 2.0 process, we would be using facets to check the names and organizations involved in the communications to help make sure there is no inadvertent production.

What about the subject matter experts?

In reading this, some of you may wonder what the role of a subject matter expert might be in a world of continuous ranking. Our answer is that the SME’s role is just as important as it was before but the work might be different. Instead of reviewing random documents at the beginning of the process, SMEs might be better advised to use their talents to find as many relevant documents as possible to help train the system. Then, as the review progresses, SMEs play a key role doing QC on reviewer judgments to make sure they are correct and consistent. Our research suggests that having experts review a portion of the documents tagged by the review team can lead to better ranking results at a much lower cost than having the SME review all of the training documents.

Ultimately, a continuous ranking process requires that the review team carry a large part of the training responsibility as they do their work. This sits well with most SMEs who don’t want to do standard review work even when it comes to relatively small training sets. Most senior lawyers that I know have no desire to review the large numbers of documents that would be required to achieve the benefits of continuous ranking. Rather, they typically want to review as few documents as possible. “Leave it to the review team,” I often hear. “That’s their job.”

Conclusion

As these three research studies demonstrate, continuous ranking can produce better results than the one-time ranking approach associated with traditional TAR. These cases suggest that potential savings can be as high as 49% over the one-time ranking process.

As you feed more seeds into the system, the system’s ability to identify responsive documents continues to improve, which makes sense. The result is that review teams are able to review far fewer documents than traditional methods require and achieve even higher rates of recall.

Traditional TAR systems give you one bite at the apple. But if you want to get down to the core, one bite won’t get you there. Continuous ranking lets one bite feed on another, letting you finish your work more quickly and at lower cost. One bite at the apple is a lot better than none, but why stop there?

[Author's note: Thanks are due to Dr. Jeremy Pickens for the underlying work that led to this article along with his patient help trying to explain these concepts to a dumb lawyer—namely me. Thanks also to Dr. William Webber for his extended comments on mistakes and misperceptions in my draft. Any mistakes remaining are mine alone. Further thanks to Ron Tienzo who caught a lot of simple mistakes through careful proofing and Bob Ambrogi, a great editor and writer.]


[1] Relevant in this case means relevant to the issues under review. TAR systems are often used to find responsive documents but they can be used for other inquiries such as privileged, hot or relevant to a particular issue.

[2] Our contextual diversity algorithm is designed to find documents that are different from those already seen and used for training. We use this method to ensure that we aren’t missing documents that are relevant but different from the mainstream of documents being reviewed.

[3] Determining when the review is complete is a subject for another day. Suffice it to say that once you determine the appropriate level of recall for a particular review, it is relatively easy to sample the ranked documents to determine when that recall threshold has been met.

[4] We make no claim that a test of three cases is anything more than a start of a larger analysis. We didn’t hand pick the cases for their results but would readily concede that more case studies would be required before you could draw a statistical conclusion. We wanted to report on what we could learn from these experiments and invite others to do the same.

[5] Our system ranks all of the documents each time we rank. We do not work off a reference set (i.e. a small sample of the documents).

[6] We recognize that IR scientists would argue that you only need to review 80% of the total population to reach 80% recall in a linear review. We could use this figure in our analysis but chose not to simply because the author has never seen a linear review that stopped before all of the documents were reviewed—at least based on an argument that they had achieved a certain recall level as a result of reaching a certain threshold. Clearly you can make this argument and are free to do so. Simply adjust the figures accordingly.

[7] This isn’t a fair comparison. We don’t have access to other TAR systems to see what results they might have after ingesting 2,500 seed documents. Nor can we simulate the process they might use to select those seeds for the best possible ranking results. But it is the data I have to work with. The gap between one-time and continuous ranking may be narrower but I believe the essential point is the same. Continuous ranking is like continuous learning: the more of it the better.

[8] In a typical review, the team would not know they were at the 80% mark without testing the document population. We know in this case because we have all the review judgments. In the real world, we recommend the use of a systematic sample to determine when target recall is being approached by the review.

[9] I chose this figure as a placeholder for the analysis. We have seen higher and lower figures depending on who is doing the review. Feel free to use a different figure to reflect your actual review costs.

[10] I used 50 documents per hour as a placeholder for this calculation. Feel free to substitute different figures based on your experience. But saving on review costs is only half the benefit of a TAR process.

[11]Most privilege reviews are not linear in the sense that all documents in a population are reviewed. Typically, some combination of searches are run to identify the likely privileged candidates. That number should be smaller than the total but can’t be specified in this exercise.