[This article originally appeared in the Winter 2014 issue of EDDE Journal, a publication of the E-Discovery and Digital Evidence Committee of the ABA Section of Science and Technology Law.]
Although still relatively new, technology-assisted review (TAR) has become a game changer for electronic discovery. This is no surprise. With digital content exploding at unimagined rates, the cost of review has skyrocketed, now accounting for over 70% of discovery costs. In this environment, a process that promises to cut review costs is sure to draw interest, as TAR, indeed, has.
Called by various names—including predictive coding, predictive ranking, and computer-assisted review—TAR has become a central consideration for clients facing large-scale document review. It originally gained favor for use in pre-production reviews, providing a statistical basis to cut review time by half or more. It gained further momentum in 2012, when federal and state courts first recognized the legal validity of the process.
More recently, lawyers have realized that TAR also has great value for purposes other than preparing a production. For one, it can help you quickly find the most relevant documents in productions you receive from an opposing party. TAR can be useful for early case assessment, for regulatory investigations and even in situations where your goal is only to speed up the production process through prioritized review. In each case, TAR has proven to save time and money, often in substantial amounts.
But what about non-English language documents? For TAR to be useful in international litigation, it needs to work for languages other than English. Although English is used widely around the world, it is not the only language you will see if you get involved in multi-national litigation, arbitration or regulatory investigations. Chinese, Japanese and Korean will be common for Asian transactions; German, French, Spanish, Russian, Arabic and Hebrew will be found for matters involving European or Middle Eastern nations. Will TAR work for documents in these languages?
Many industry professionals doubted that TAR would work on non-English documents. They reasoned that the TAR process was about “understanding” the meaning of documents. It followed that unless the system could understand the documents—and presumably computers understand English—the process wouldn’t be effective.
The doubters were wrong. Computers don’t actually understand documents; they simply catalog the words in documents. More accurately, we call what they recognize “tokens,” because often the fragments (numbers, misspellings, acronyms and simple gibberish) are not even words. The question, then, is whether computers can recognize tokens (words or otherwise) when they appear in other languages.
The simple answer is yes. If the documents are processed properly, TAR can be just as effective for non-English as it is for English documents. After a brief introduction to TAR and how it works, I will show you how this can be the case. We will close with a case study using TAR for Japanese documents.
What is TAR?
TAR is a process through which one or more humans interact with a computer to train it to find relevant documents. Just as there are many names for the process, there are many variations of it. For simplicity’s sake, I will use Magistrate Judge Andrew J. Peck’s definition in Da Silva Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012), the first case to approve TAR as a method to shape document review:
By computer assisted review, I mean tools that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with a human reviewer.
It is about as simple as that:
- A human (subject matter expert, often a lawyer) sits down at a computer and looks at a subset of documents.
- For each, the lawyer records a thumbs-up or thumbs-down decision (tagging the document). The TAR algorithm watches carefully, learning during this training.
- When the training session is complete, we let the system rank and divide the full set of documents between (predicted) relevant and irrelevant.
- We then review the relevant documents, ignoring the rest.
The benefits from this process are easy to see. Let’s say you started with a million documents that otherwise would have to be reviewed by your team. If the computer algorithm predicted with the requisite degree of confidence that 700,000 are likely not-relevant, you could then exclude them from the review for a huge savings in review costs. That is a great result, particularly if you are the one paying the bills. At four dollars a document for review (to pick a figure), you just saved $2.8 million. And the courts say this is permissible.
How is TAR Used?
TAR can be used for several purposes. The classic use is to prioritize the review process, typically in anticipation of an outgoing production. You use TAR to sort the documents in order of likely relevance. The reviewers do their work in that order, presumably reviewing the most likely relevant ones first. When they get to a point where the number of relevant documents drops significantly, suggesting that they have seen most of them, the review stops. Somebody then samples the unreviewed documents to confirm that the number of relevant documents remaining is sufficiently low to justify discontinuing further, often expensive, review.
We can see the benefits of a TAR process through the following chart, which is known as a yield curve:
A yield curve presents the results of a ranking process and is a handy way to visualize the difference between two processes. The X axis shows the percentage of documents that are available for review. The Y axis shows the percentage of relevant documents found at each point in the review.
As a base line, I created a gray diagonal line to show the progress of a linear review (which essentially moves through the documents in random order). Without a better means for ordering the documents by relevance, the recall rates for a linear review typically match the percentage of documents actually reviewed¾hence the straight line. By the time you have seen 80% of the documents, you probably have seen 80% of the relevant documents.
The blue line shows the progress of a TAR review. Because the documents are ranked in order of likely relevance, you see more relevant documents at the front end of your review. Following the blue line up the Y axis, you can see that you would reach 50% recall (have viewed 50% of the relevant documents) after about 5% of your review. You would have seen 80% of the relevant documents after reviewing just 10% of the total review population.
This is a big deal. If you use TAR to organize your review, you can dramatically improve the speed at which you find relevant documents over a linear review process. Assuming the judge will let you stop your review after you find 80% of the documents (and some courts have indicated this is a valid stopping point), review savings can be substantial.
You can also use this process for other purposes. Analyzing inbound productions is one good example. These are often received shortly before depositions begin. If you receive a million or so documents in a production, how are you to quickly find which ones are important and which are not?
Here is an example where counsel reviewed about 200,000 documents received not long before depositions commenced and found about 5,700 which were “hot.” Using a small set of their own judgments about the documents for training, we were able to demonstrate that they would have found the same number of hot documents after reviewing only 38,000 documents. They could have stopped there and avoided the costs of reviewing the remaining 120,000 documents.
You can also use this process for early case assessment, using the ranking engine to place a higher number of relevant documents at the front of the stack.
What about non-English Documents?
To understand why TAR can work with non-English documents, you need to know two basic points:
- TAR doesn’t understand English or any other language. It uses an algorithm to associate words with relevant or irrelevant documents.
- To use the process for non-English documents, particularly those in Chinese and Japanese, the system has to first tokenize the document text so it can identify individual words.
We will hit these topics in order.
1. TAR Doesn’t Understand English
It is beyond the province of this article to provide a detailed explanation of how TAR works, but a basic explanation will suffice for our purposes. Let me start with this: TAR doesn’t understand English or the actual meaning of documents. Rather, it simply analyzes words algorithmically according to their frequency in relevant documents compared to their frequency in irrelevant documents.
Think of it this way. We train the system by marking documents as relevant or irrelevant. When I mark a document relevant, the computer algorithm analyzes the words in that document and ranks them based on frequency, proximity or some other such basis. When I mark a document irrelevant, the algorithm does the same, this time giving the words a negative score. At the end of the training process, the computer sums up the analysis from the individual training documents and uses that information to build a search against a larger set of documents.
While different algorithms work differently, think of the TAR system as creating huge searches using the words developed during training. It might use 10,000 positive terms, with each ranked for importance. It might similarly use 10,000 negative terms, with each ranked in a similar way. The search results would come up in an ordered fashion sorted by importance, with the most likely relevant ones coming first.
None of this requires that the computer know English or the meaning of the documents or even the words in them. All the computer needs to know is which words are contained in which documents.
2. If Documents are Properly Tokenized, the TAR Process Will Work.
Tokenization may be an unfamiliar term to you but it is not difficult to understand. When a computer processes documents for search, it pulls out all of the words and places them in a combined index. When you run a search, the computer doesn’t go through all of your documents one by one. Rather, it goes to an ordered index of terms to find out which documents contain which terms. That’s why search works so quickly. Even Google works this way, using huge indexes of words.
As I mentioned, however, the computer doesn’t understand words or even that a word is a word. Rather, for English documents it identifies a word as a series of characters separated by spaces or punctuation marks. Thus, it recognizes the words in this sentence because each has a space (or a comma) before and after it. Because not every group of characters is necessarily an actual “word,” information retrieval scientists call these groupings “tokens,” and the act of identifying these tokens for the index as “tokenization.”
All of these are tokens:
And so on. All of these will be kept in a token index for fast search and retrieval.
Certain languages, such as Chinese and Japanese, don’t delineate words with spaces or western punctuation. Rather, their characters run through the line break, often with no breaks at all. It is up to the reader to tokenize the sentences in order to understand their meaning.
Many early English-language search systems couldn’t tokenize Asian text, resulting in search results that often were less than desirable. More advanced search systems, like the one we chose for Catalyst, had special tokenization engines which were designed to index these Asian languages and many others that don’t follow the Western conventions. They provided more accurate search results than did their less-advanced counterparts.
Similarly, the first TAR systems were focused on English-language documents and could not process Asian text. At Catalyst, we added a text tokenizer to make sure that we handled these languages properly. As a result, our TAR system can analyze Chinese and Japanese documents just as if they were in English. Word frequency counts are just as effective for these documents and the resulting rankings are as effective as well.
A Case Study to Prove the Point.
Let me illustrate this with an example from a matter we handled not long ago. We were contacted by a major U.S. law firm that was facing review of a set of mixed Japanese and English language documents. It wanted to use TAR on the Japanese documents, with the goal of cutting both the cost and time of the review, but was uncertain whether TAR would work with Japanese.
Our solution to this problem was to first tokenize the Japanese documents before beginning the TAR process. Our method of tokenization—also called segmentation—extracts the Japanese text and then uses language-identification software to break it into words and phrases that the TAR engine can identify.
To achieve this, we loaded the Japanese documents into our review platform. As we loaded the documents, we performed language detection and extracted the Japanese text. Then, using our proprietary technology and methods, we tokenized the text so the system would be able to analyze the Japanese words and phrases.
With tokenization complete, we could begin the TAR process. In this case, senior lawyers from the firm reviewed 500 documents to create a reference set to be used by the system for its analysis. Next, they reviewed a sample set of 600 documents, marking them relevant or non-relevant. These documents were then used to train the system so it could distinguish between likely relevant and likely non-relevant documents and use that information for ranking.
After the initial review, and based on the training set, we directed the system to rank the remainder of the documents for relevance. The results were compelling:
- The system was able to identify a high percentage of likely relevant documents (98%) and place them at the front of the review queue through its ranking process. As a result, the review team would need to review only about half of the total document population (48%) to cover the bulk of the likely relevant documents.
- The remaining portion of the documents (52%) contained a small percentage of likely relevant documents. The review team reviewed a random sample from this portion and found only 3% were likely relevant. This low percentage suggested that these documents did not need to be reviewed, thus saving the cost of reviewing over half the documents.
By applying tokenization before beginning the TAR process, the law firm was able to target its review toward the most-likely relevant documents and to reduce the total number of documents that needed to be reviewed or translated by more than half.
As corporations grow increasingly global, legal matters are increasingly likely to involve non-English language documents. Many believed that TAR was not up to the task of analyzing non-English documents. The truth, however, is that with the proper technology and expertise, TAR can be used with any language, even difficult Asian languages such as Chinese and Japanese.
Whether for English or non-English documents, the benefits of TAR are the same. By using computer algorithms to rank documents by relevance, lawyers can review the most important documents first, review far fewer documents overall, and ultimately cut both the cost and time of review. In the end, that is something their clients will understand, no matter what language they speak.