Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn
Follow Us:
Technology, Techniques and Best Practices
John Tredennick

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.
Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer's Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer's Guide to Microsoft Excel 2007 (ABA Press 2009).

John has been widely honored for his achievements. In 2013, he was named by the American Lawyer as one of the top six “E-Discovery Trailblazers” in their special issue on the “Top Fifty Big Law Innovators” in the past fifty years. In 2012, he was named to the FastCase 50, which recognizes the smartest, most courageous innovators, techies, visionaries and leaders in the law. London's CityTech magazine named him one of the "Top 100 Global Technology Leaders." In 2009, he was named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region. Also in 2009, he was named the Top Technology Entrepreneur by the Colorado Software and Internet Association.

John is the former chair of the ABA's Law Practice Management Section. For many years, he was editor-in-chief of the ABA's Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents. In his spare time, you will find him competing on the national equestrian show jumping circuit.

The Five Myths of Technology Assisted Review, Revisited

Tar PitOn Jan. 24, Law Technology News published John’s article, “Five Myths about Technology Assisted Review.” The article challenged several conventional assumptions about the predictive coding process and generated a lot of interest and a bit of dyspepsia too. At the least, it got some good discussions going and perhaps nudged the status quo a bit in the balance.

One writer, Roe Frazer, took issue with our views in a blog post he wrote. Apparently, he tried to post his comments with Law Technology News but was unsuccessful. Instead, he posted his reaction on the blog of his company, Cicayda. We would have responded there but we don’t see a spot for replies on that blog either.

We love comments like these and the discussion that follows. This post offers our thoughts on the points raised by Mr. Frazer and we welcome replies right here for anyone interested in adding to the debate. TAR 1.0 is a challenging-enough topic to understand. When you start pushing the limits into TAR 2.0, it gets really interesting. In any event, you can’t move the industry forward without spirited debate. The more the merrier.

We will do our best to summarize Mr. Frazer’s comments and offer our responses.

1. Only One Bite at the Apple?

Mr. Frazer suggests we were “just a bit off target” on the nature of our criticism. He rightly points out that litigation is an iterative (“circular” he calls it) business.

When new information comes into a case through initial discovery, with TAR/PC you must go back and re-train the system. If a new claim or new party gets added, then a document previously coded one way may have a completely different meaning and level of importance in light of the way the data facts changed. This is even more so the case with testimony, new rounds of productions, non-party documents, heck even social media, or public databases. If this happens multiple times, you wind up reviewing a ton of documents to have any confidence in the system. Results are suspect at best. Cost savings are gone. Time is wasted. Attorneys, entrusted with actually litigating the case, do not and should not trust it, and thus smartly review even more documents on their own at high pay rates. I fail to see the value of “continuous learning”, or why this is better. It cannot be.

He might be missing our point here. Certainly he is correct when he says that more training is always needed when new issues arise, or when new documents are added to the collection. And there are different ways of doing that additional training, some of which are smarter than others. But that is the purview of Myth #4, so we’ll address it below. Let us, therefore, clarify that when we’re talking about “only one bite of the apple,” we’re talking about what happens when the collection is static and no new issues are added.

To give a little background, let us explain what we understand to be the current, gold standard TAR workflow, to which we are reacting. What we see the industry in general saying is that the way TAR works is that you get ahold of the most senior, experienced, expertise-laden individual that you can, and then you sit that person down in front of an active learning TAR training (learning) algorithm and have the person iteratively judge thousands of documents until the system “stabilizes.” Then you apply the results of that learning to your entire collection and batch out the top documents to your contract review team for final proofing. At the point you do that batching, says the industry, learning is complete, finito, over, done. Even if you trust your contract review team to judge batched-out documents, none of those judgments are ever fed back into the system, to be used for further training to improve the ranking from the algorithm.

Myth #1 says that it doesn’t have to be that way. What “continuous learning” means is that all judgments during the review should get fed back into the core algorithm to improve the quality with regard to any and all documents that have not yet received human attention. And the reason why it is better? Empirically, we’ve seen it to be better. We’ve done experiments in which we’ve trained an algorithm to “stability,” and then we’ve continued training even during the batched-out review phase – and seen that the total number of documents that need to be examined until a defensible threshold is hit continues to go down. Is there value in being able to save even more on review costs? We think that there is.

You can see some of the results of our testing on the benefits of continuous learning here.

2. Are Subject Matter Experts Required?

We understand that this is a controversial issue and that it will take time before people become comfortable with this new approach. To quote Mr. Frazer:

To the contrary, using a subject matter expert is critical to the success of litigation – that is a big reason AmLaw 200 firms get hired. Critical thinking and strategy by a human lawyer is essential to a well-designed discovery plan. The expertise leads to better training decisions and better defensibility at the outset. I thus find your discussion of human fallibility and review teams puzzling.

Document review is mind numbing and people are inconsistent in tagging which is one of the reasons for having the expert in the first place. With a subject matter expert, you are limiting the amount of fallible humans in the process. We have seen many “review lawyers” and we have yet to find one who does not need direction by a subject matter expert. One of the marketing justifications for using TAR/PC is that human review teams are average to poor at finding relevant documents – it must be worse without a subject matter expert. I do agree with your statement that “most senior attorneys… feel they have better things to do than TAR training in any event.” With this truth, you have recognized the problem with the whole system: Spend $100k+ on a review process, eat up a large portion of the client’s litigation budget, yet the expert litigation team who they hired has not looked at a single document, while review attorneys have been “training” the system? Not relying on an expert seems to contradict your point  3, ergo.

Again, the nature of this response indicates that you are approaching this from the standard TAR workflow, which is to have your most senior expert sit for a number of days and train to stability, and then never have the machine learn anything again. To dispel the notion that this workflow is the only way in which TAR can or even should work is one reason we’re introducing these myths in the first place. What we are saying in our Myth #2 is not that you would never have senior attorneys or subject matter experts involved in any way. Of course that person should train the contract reviewers.  Rather, we are saying that you can involve non-experts, non-senior attorneys in the actual training of the system and achieve results that are just as good as having *only* a senior attorney sit and train the system.  And our method dramatically lowers both your total time cost and your total monetary cost in the process.

For example, imagine a workflow in which your contract reviewers, rather than your senior attorney, do all the initial training on those thousands of documents. Then, at some point later in the process, your senior attorney steps in and re-judges a small fraction of the existing training documents. He or she corrects via the assistance of a smart algorithm only the most egregious, inconsistent training outliers and then resubmits for a final ranking. We’ve tested this workflow empirically, and found that it yields results that are just as good, if not better, than the senior attorney working alone, training every single document. (See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?)

Moreover, you can get through training more quickly, because now you have a team working in parallel, rather than an individual working serially. Add to that the fact that your senior attorney does not always have free time the moment that training needs to get done, the flexibility to bring that senior attorney in at a later point, and do a tenth of the work that he or she would otherwise have to do, and you have a recipe for success. That’s what this myth is about – the notion that the rest of the industry has, and your own response indicates, that unless a senior attorney does every action that in any way affects the training of the system, it is a recipe for disaster. It is not; that is a myth.

And again, we justify this not through appeals to authority (“that is a big reason AmLaw 200 firms get hired”), but through empirical methods. We’ve tested it out extensively. But if appeals to authority are what is needed to show that the algorithms we employ are capable of successfully supporting these alternative workflows, we can do so. Our in-house senior research scientist, Jeremy Pickens, has his PhD from one of the top two information retrieval research labs in the country, and not only holds numerous patents on the topic, but has received the best paper award at the top information retrieval conference in the world (ACM SIGIR). Blah blah blah. But we’d prefer not to have to resort to appeals to authority, because empirical validation is so much more effective.

Please note also that we in no way *force* you to use non-senior attorneys during the training process. You are of course free to work however you want to work. However, should time or money be an issue, we’ve designed our system so as to allow you to successfully and more efficiently work in a way that doesn’t only require senior attorneys or experts to do your training, exclusively.

You can see the results of our research on the use of subject matter experts here and here.

3. Must I Train on Randomly Selected Documents?

We pointed out in our article that it is a myth that TAR training can only be on random documents.

You totally glossed over bias. Every true scientific study says that judgmental sampling is fraught with bias. Research into sampling in other disciplines is that results from judgmental sampling should be accepted with extreme caution at best. It is probably even worse in litigation where “winning” is the driving force and speed is omnipresent. Yet, btw, those who advocate judgmental sampling in eDiscovery, Grossman, e.g., also advocate that the subject matter experts select the documents – this contradicts your points in 2. You make a true point about the richness of the population making it difficult to find documents, but this militates against random selection, not for it. To us this shows another reason why TAR/PC is broken. Indeed “clicking through thousands of random documents is boring” – but this begs the question. It was never fun reviewing a warehouse of banker’s documents either. But it is real darn fun when you find the one hidden document that ties everything together, and wins your case. What is boring or not fun has nothing to do with the quality of results in a civil case or criminal investigation.

I hope we have managed to clarify that Myth #2 is not actually saying that you never have to involve a senior attorney in any way, shape or form. Rather we believe that a senior attorney doesn’t have to do every single piece of the TAR training, in all forms, at all times. Once you understand this, you quickly realize that there is no contradiction between what Maura Grossman is saying and what we are saying.

If you want to do judgmental sampling, let your senior attorney and all of his or her wisdom be employed in the process of creating the search queries used to find interesting documents. But instead of requiring that senior person to then look at every single result of those search queries, let your contract reviewers comb through those. In that manner, you can involve your senior attorney where his or her skills are the most valuable and where his or her time is the most precious. It takes a lot less time to issue a few queries than it does to sit and judge thousands of documents. Are we the only vendor out there aware of the notion that the person who issues searches for the documents and who judges all the found documents doesn’t have to be the same person? We would hope not, but perhaps we are.

Now, to the issue of bias. You’re quite right to be concerned about this, and we fault the necessary brevity of our original article in not being able to go into enough detail to satisfy your valid concerns. So we would recommend reading the following article, as it goes into much more depth about how bias is overcome when you start judgmentally, and it backs up its explanations empirically: Predictive Ranking: Technology-Assisted Review Designed for the Real World.

Imagine your TAR algorithm as a seesaw. That seesaw has to be balanced, right? So you have many in the industry saying that the only way to balance it is to randomly select documents along the length of that seesaw. In that manner, you’ll approximately have the same number of docs, at the same distance from the center, on both sides of the seesaw. And the seesaw will therefore be balanced. Judgmental sampling, on the other hand, is like plopping someone down on the far end of the seesaw. That entire side sinks down, and raises the other side high into the air, throwing off the balance. Well, in that case, the best way to balance the seesaw again is to explicitly plop down another equal weight on the exact opposite end of the seesaw, bringing the entire system to equilibrium.

What we’ve designed in the Catalyst system is an algorithm that we call “contextual diversity.” “Contextual” refers to where things have already been plopped down on that seesaw. The “diversity” means “that area of the collection that is most about the things that you know that you know nothing about,” i.e. that exact opposite end of the seesaw, rather than some arbitrary, random point. Catalyst’s contextual diversity algorithm explicitly finds and models those balancing points, and surfaces those to your human judge(s) for coding. In this manner, you can both start judgmentally *and* overcome bias. We apologize that this was not as clear in the original 5 Myths article, but we hope that this explanation helps.

We go into this subject in more detail here.

4. You Can’t Start Training until You Have All of Your Documents

One of the toughest issues in TAR systems is the requirement that you collect all of your documents before you start TAR training. This limitation stems from the use of a randomly selected control set to both guide training and provide defensibility. If you add new documents to the mix (rolling uploads), they will not be represented in the control set. Thus even if you continue training with some of these new documents, your control set would be invalid and you lose defensibility.

You might have missed that point in your comments:

I think this is similar to #1 in that you are not recognizing the true criticism that things change too much in litigation. While you can start training whenever you want and there are algorithms that will allow you to conduct new rounds on top of old rounds – the real problem is that you must go back and change previous coding decisions because the nature of the case has changed. To me, this is more akin to “continuous nonproductivity” than “continuous learning.”

The way in which we natively handle rolling uploads from a defensibility standpoint is to not rely on a human-judged control set. There are other intelligent metrics we use to monitor the progress of training, so we do not abandon the need for reference, or our defensibility, altogether – just the need for expensive, human-judged reference.

The way other systems have to work, in order to keep their control set valid, is to judge another statistically valid sample of documents from the newly arrived set. And in our experience, in the cases we’ve dealt with over the past five years, there have been on average around 67 separate uploads until the collection was complete. Let’s be conservative and assume you’re dealing with a third of that – say only 20 separate uploads from start to finish. As each new upload arrives, you’re judging 500 randomly selected documents just to create a control set. 500 * 20 = 10,000. And let’s suppose your senior attorney gets through 50 documents an hour. That’s 200 hours of work just to create a defensible control set, with not even a single training document yet judged.  And since you’ve already stated that you need to hire an AmLaw 200 senior attorney to judge these documents, at $400/hour that would be $80,000. Our approach saves you that money right off the bat by being able to natively handle the control set/rolling upload issue. Plug in your own numbers if you don’t like these, but our guess is that it’ll still add up to a significant savings.

But the control set is only half of the story. The other half is the training itself. Let us distinguish if we may between an issue that changes, and a collection that changes. If it is your issue itself (i.e. your definition of responsiveness) that changes when new documents are collected, then certainly nothing we’ve explicitly said in these Five Myths will address that problem. However, if all you are talking about is the changing expression of an unchanging issue, then we are good to go.

What do we mean by the changing expression of an unchanging issue? We mean that if you’ve collected from your engineering custodians first, and started to train the system on those documents, and then suddenly a bunch of marketing custodians arrive, that doesn’t actually change the issue that you’re looking for. What responsiveness was about before is still what responsiveness is about now. However, how that issue is expressed will change. The language that the marketers use is very different than the language that the engineers use, even if they’re talking about the same responsive “aboutness.”

This is exactly why training is a problem for the standard TAR 1.0 workflow. If you’re working in a way that requires your expert to judge all the documents up front, then if the collection grows (by adding the marketing documents to the engineering collection), that expert’s work is not really applicable to the new information and you have to go back to the drawing board, selecting another set of random documents so as to avoid bias, feed those yet again to a busy, time-pressed expert, etc. That is extremely inefficient.

What we do with our continuous learning is once again employ that “contextual diversity” algorithm that we mentioned above. Let us return to the seesaw analogy. Imagine that you’ve got your seesaw, and through the training that you’ve done it is now completely balanced. Now, a new subset of (marketing) documents appears; that is like adding a third plank to the original seesaw. Clearly what happens is that now things are unbalanced again. The two existing planks sink down to the ground, and that third plank shoots up into the air. So how do we solve for that imbalance, without wasting the effort that has gone into understanding the first two planks? Again, we use our contextual diversity algorithm to find the most effective balance point, in the most efficient, direct (aka non-random) manner possible.

Contextual diversity cares neither why nor how the training over a collection of documents is imbalanced. It simply detects the most effective points that, once pressure is applied to those points, rebalance the system. It does not matter if the seesaw started with two planks and then suddenly grew a third via rolling uploads, or if the seesaw started with three planks, and someone’s judgmental sampling only hit two of those planks. In both cases, there is imbalance, and in both cases, the algorithm explicitly models and corrects for that imbalance.

You can read more about this topic here.

5. TAR Does Not Work for non-English Documents

Many people have now realized that, properly done, TAR can work for other languages including the challenging CJK (Chinese, Japanese and Korean) languages. As we explained in the article, TAR is a “mathematical process that ranks documents based on word frequency. It has no idea what the words mean.”

Mr. Frazer seems to agree but is pitching a different kind of system for TAR:

Words are the weapons of lawyers so why in the world would you use a technology that does not know what they mean? TAR & PC are, IMHO, roads of diversion (perhaps destruction in an outlier case) for the true litigator. They are born out of the need to reduce data, rather than know what is in a large data set. They ignore a far better system is one that empowers the subject matter experts, the true litigators, and even the review team to use their intelligence, experience, and unique skills to find what they need, quickly and efficiently, regardless of how big the data is. A system is needed to understand the words in documents, and read them as a human being would.

There are a lot of points that we could say in response to this, but this post is lengthy enough as it is. So let us briefly make just two points. The first is that we think Natural Language Processing (which apparently your company uses) techniques are great. There is a lot of value there. And frankly, we think that NLP techniques complement, rather than oppose, the more purely statistical techniques.

That said, our second point is simply to note that in some of the cases that we’ve dealt with here at Catalyst, we have datasets in which over 85% of the documents are computer source code. Where there is no natural language, there can be no NLP. And yet TAR still has to be able to handle those documents as well. So perhaps we should extend Myth #5 to say that it’s a myth that “TAR Does Not Work for Non-Human Language Documents.”

Conclusion

In writing the Five Myths of TAR, our point wasn’t to claim that Catalyst has the only way to address the practical limitations of early TAR systems. To the contrary, there are many approaches to technology-assisted review which a prospective user will want to consider, and some are more cost and time effective than others. Rather, our goal was to dispel certain myths that limit the utility of TAR and let people know that there are practical answers to early TAR limitations. Debating which of those answers works best should be the subject of many of these discussions. We enjoy the debate and try to learn from others as we go along.

In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey)

In a recent blog post, Ralph Losey tackles the issue of expertise and TAR algorithm training.  The post, as is characteristic of Losey’s writing, is densely packed.  He raises a number of different objections to doing any sort of training using a reviewer who is not a subject matter expert (SME).  I will not attempt to unpack every single one of those objections.  Rather, I wish to cut directly to the fundamental point that underlies the belief in the absolute necessity that an SME, and only an SME, should provide the judgments, the document codings, that get used for training:

Losey writes:

Quality of SMEs is important because the quality of input in active machine learning is important. A fundamental law of predictive coding as we now know it is GIGO, garbage in, garbage out. Your active machine learning depends on correct instruction.

Active machine learning depends on a number of different factors, including but not limited to the type of features (aka “signals”) that are extracted from the data and the complexity of the data itself (how “separable” the data is), even if perfect and complete labeling of every document in the collection were available.  All of these factors have an effect on the quality of the output.  But yes, one of those factors is the labels on the documents, the human-given coding.

Coding quality is indeed important.  However, what I question is this seemingly “common sense objection” of garbage in, garbage out (GIGO).  In offline discussion with Ralph, I was able to distill this common sense objection into an even purer form, a part of this conversation which I reprint here with permission:

Sorry, but wrong decisions to find the right docs sounds like alchemy to me, lead to gold.

This is the essence of the entire conundrum: Can wrong decisions be used to find the right documents?  If it can be shown that wrong decisions can indeed be used to find the right documents, then while that does not automatically answer every single one of Ralph’s objections, it provides a solid foundation on which to do so as the industry continues to iterate our understanding of TAR’s capabilities. Thus, the purpose of this post is to focus on the fact that this can be done.  Future posts will then apply the principle to real world TAR workflows.

You may also want to read our prior posts related to this topic:

Pseudo Relevance Feedback

The first manner in which we show that wrong decisions can be used to find right documents is to turn to an old information retrieval concept known as pseudo relevance feedback (aka blind feedback).  Imagine running a search on a collection of documents, and getting back a list of results, some of which are relevant and some of which are not.  Ideally, you would want all the relevant ones to be toward the top, and the non-relevant ones at the bottom.  We all know that doesn’t happen.  So the technique of pseudo relevance is employed to improve the quality of the ranking.  PRF operates in the following manner:

  1. The top k (usually a couple dozen) results are selected from the top of the existing ranking.
  2. All top k documents are blindly judged to be relevant. That is, they’re automatically coded as relevant, whether or not they truly are.
  3. Those top k documents, with their relevant=true coding, are then fed back to the machine learner, and the ranking is altered based on this blind, or pseudo-relevant, feedback.

In the PRF regimen, there are many documents in this top k set that are truly not relevant, and yet they are being coded as relevant and used for training by a human coder so naïve that (s)he is coding those documents blindly.  Under the GIGO principle, this would mean that these “garbage” wrong judgments would cause the quality of the ranking, the number of truly responsive documents  at the top ranks of the list, to go down.

And yet that turns out to not be the case.

As far back as 20 years ago (1994), information retrieval researchers were reporting that using nonrelevant documents to find relevant ones yielded better results.  For example, see Automatic Query Expansion Using SMART: TREC 3, by Buckley, Salton, Allan, and Singhal:

Massive query expansion also works in general for the ad-hoc experiments, where expansion and weighting are based on the top initially retrieved documents instead of known relevant documents. In the ad-hoc environment this approach may hurt performance for some queries (e.g. those without many relevant documents in the top retrieved set), but overall proves to be worthwhile with an average 20% improvement.

The astute reader will note, and might complain, that while performance does improve on average, and for the vast majority of queries, it is not universal.  “Why risk making some things worse,” one might ask, “even if most things get better?”  There are two answers to that.

The first answer is that, because PRF is a decades-old, established technique in the information retrieval world, there is a large, active body of research around it.  There are indeed researchers who have explored the trade-off between risk and reward (Estimation and Use of Uncertainty in Pseudo-relevance Feedback) and have learned to optimize around it (Accounting for Stability of Retrieval Algorithms using Risk-Reward Curve). These are but a few of many available papers that address the topic.

The second answer is simply to note that the goal here is not (yet) to address detailed issues of workflow, risk-mitigation, or total annotation cost.  Those deserve separate, full length treatises.  Rather, the goal here is simply to dispel the notion that “wrong” decisions cannot lead to “right” documents. By and large the body of literature on pseudo relevance feedback shows that they can. Full stop.

Experiments Using Only Wrong Documents

However, readers might feel like raising the objection that PRF doesn’t use “wrong” judgments so much as it uses diluted “right” judgments.  After all, there are some truly relevant document in the top k set that gets used for training, and so even while there are some truly non-relevant documents that get blindly marked as relevant, there are also some truly relevant documents that get marked as relevant, in the same way that even a broken watch correctly tells the time twice a day.  However, even if some truly relevant documents are mixed in with the blind feedback, that doesn’t change the fact that wrong decisions are still leading to right documents.  I also argue that this is a realistic parallel to what a non-SME would do, which is to still make a lot of right decisions, with some wrong decisions mixed in.  However, because of this potential objection, I will take things one step further.

This leads us to the second manner in which we show that wrong decisions can be used to find right documents.  And this one, I giddily foreshadow, is going to be a little more extreme in its demonstration.  These are some experiments that we’ve done using the proprietary Catalyst algorithms, so I will not talk about the algorithms, only the outputs.  The setup is as follows: As part of some of the earlier TREC Legal Track runs, ground truth (human judgments on documents) was established by TREC analysts.  However, teams that participated in the runs were allowed to submit documents that they felt were incorrectly judged, and the topic authority for the matter then made the final adjudication.  In some cases, the original judgment was upheld.  In some cases, it was overturned, and the topic authority made the final, correct call for a document that was different than the original non-authoritative reviewer had given.

For our experiment, we collected the docids of all those documents with topic authority overturns.  For the training of our system, we used only those docids, no more and no less.  However, we established two conditions.  In the first, the docids were given the coding value of responsive or non-responsive based on the final, topic authority judgment of that document.  In the second, the exact same docids were given the coding value of based on the original, non-authoritative reviewer.  That is to say, in the second condition, the judgments weren’t just slightly wrong, they were 100% wrong.  In this second condition, all documents marked by the topic authority as responsive were given the training value of non-responsive, and vice versa.

The algorithms then used this training data from each condition separately to produce a ranking over the remainder of the collection.  The quality of this ranking was determined using test data that consisted of the remainder of the judgments for which there was no disagreement, i.e. that either every single team participating in the TREC evaluation felt were correctly judged in the first place, or that the topic authority personally adjudicated and kept the original marking. These results were visualized in the form of yield curve. The x-axis is the depth in the ranking, and the y-axis is the cumulative number of truly responsive documents available at that depth in the ranking.  We do not show raw counts, but we do show a blue line which represents the expected cumulative rate of discovery for manual linear review, i.e. what would happen on average if you were to review documents in a random order.

Experiment Yield Curve

The perpetually astute reader might be tempted at this point to shout out, “Aha! See, I knew it! The authoritative user’s judgments lead to better outcomes than the non-authoritative user. Garbage in, garbage out. Our position is vindicated!”  Let me, however, remind that reader of one fact: The “non-authoritative” training in this case are documents with 100% wrong judgments.  Not 10% wrong, 25% wrong, or even 50% wrong, as you might expect from a non-expert, but trained contract reviewer.  But 100% wrong.  Keep that in mind as you compare these yield curves against manual linear review (blue line).  What this experiment shows is that even when this training data is 100% wrong, the rate at which you are able to discover responsive documents — at least using the Catalyst algorithm with its proprietary algorithmic compensation factors — significantly outperforms manual linear review.

Let me remind the reader of the goal of this exercise, which is to show that wrong decisions can be used to find right documents.  How we deal with various wrong decisions to mitigate risk, to maximize yield, etc. is a secondary question.  And it is one that is proper for the reader to ask. However, that question cannot be asked unless one first is willing to accept the notion that wrong decisions can lead to right documents.  That is the primary question, and the foundation on which we will be able to build further discussion of how exactly to deal with various kinds of wrongness, and to what extent it does or does not affect the overall outcome.

Lest the reader believe that this is an unrepeatable example, let us show another topic, with the experiment similarly designed:

Experiment Yield Curve 2

Now the yield curve for this experiment was lower than in the previous experiment, which has a lot to do with training set size, characteristics of the data, etc.  But the story that it tells is similar: Even training using documents that are 100% wrong in their labeling gives a yield that outperforms manual linear review.  All else aside, wrong decisions can and do lead to right documents.

Wrongness Indeed Leads to Rightness

I suppose one might also note that in this particular case, not only did wrong decisions lead to right documents, but those wrong decisions led to more right documents (higher yield) at various points than did the right decisions. Again, however, as I noted in the previous experiment, the goal here is not to compare, not to delve into the workflow details about how to use wrong or right decisions. The goal is simply to show as a first step that wrongness can indeed lead to rightness.

We’ve repeated this experiment on a number of additional TREC matters, as well as on some of our own matters, and have consistently found the same outcome.  The common sense objection of “garbage in, garbage out” masks a host of underlying realities and algorithmic workarounds.  I believe that there is a common — I think even unconscious — assumption in the industry that anything that is not 100% correct is “garbage.” What I hope is that this post opens the door to the possibility that there is a wide spectrum in between garbage and perfection.

When it comes to producing documents, we as an industry often talk about the standard of reasonableness, rather than perfection.  So why is it that when it comes to coding our training documents, we have a blind spot (yes, that’s a PRF pun) to the idea that reasonable coding calls can also lead to reasonable outcomes.  It is a false dichotomy to assume that the only two choices are between garbage and complete expertise.  This post has shown that imperfect inputs, wrong decisions, are capable of leading to right documents.  That by itself does not wipe away every objection that was raised by Losey’s post – more discussion and experimental evidence is required – but it does undermine the foundation of those objections.

Predictive Ranking: Technology Assisted Review Designed for the Real World

Why Predictive Ranking?

Most articles about technology assisted review (TAR) start with dire warnings about the explosion in electronic data. In most legal matters, however, the reality is that the quantity of data is big, but it is no explosion. The fact of the matter is that even a half million documents—a relatively small number in comparison to the “big data” of the web—pose a significant and serious challenge to a review team. That is a lot of documents and can cost a lot of money to review, especially if you have to go through them in a manual, linear fashion. Catalyst’s Predictive Ranking bypasses that linearity, helping you zero-in on the documents that matter most. But that is only part of what it does.

In the real world of e-discovery search and review, the challenges lawyers face come not merely from the explosion of data, but also from the constraints imposed by rolling collection, immediate deadlines, and non-standardized (and at times confusing) validation procedures. Overcoming these challenges is as much about process and workflow as it is about the technology that can be specifically crafted to enable that workflow. For these real-world challenges, Catalyst’s Predictive Ranking provides solutions that no other TAR process can offer.

In this article, we will give an overview of Catalyst’s Predictive Ranking and discuss how it differs from other TAR systems in its ability to respond to the dynamics of real-world litigation. But first, we will start with an overview of the TAR process and discuss some concepts that are key to understanding how it works.

What is Predictive Ranking?

Predictive Ranking is Catalyst’s proprietary TAR process. We developed it more than four years ago and have continued to refine and improve it ever since. It is the process used in our newly released product, Insight Predict.

In general, all the various forms of TAR share common denominators: machine learning, sampling, subjective coding of documents, and refinement. But at the end of the day, the basic concept of TAR is simple, in that it must accomplish only two essential tasks:

  1. Finding all (or “proportionally all”) responsive documents.
  2. Verifying that all (or “proportionally all”) responsive documents have been found.

That is it. For short, let us call these two goals “finding” and “validating.”

Finding Responsive Documents

Finding consists of two parts:

  1. Locating and selecting documents to label. By “label,” we mean manually mark them as responsive or nonresponsive.
  2. Propagating (via an algorithmic inference engine) these labels onto unseen documents.

This process of finding or searching for responsive documents is typically evaluated using two qualitative measures: precision and recall. Precision is a measure of the number of true hits (actually responsive documents) in the search compared against the total number of hits returned. Recall is a measure of the total true hits returned from the search against the actual number of true hits in the population.

One area of contention and disagreement among vendors is step 1, the sampling procedures used to train the algorithm in step 2. Vendors’ philosophies general fall into one of two camps, which loosely can be described as “judgmentalists” and “randomists.”

The judgmentalist approach assumes that litigation counsel (or the review manager) has the most insightful knowledge about the domain and matter and is therefore going to be the most effective at choosing training documents. The randomist approach, on the other hand, is concerned about bias. Expertise can help the system quickly find certain pockets of responsive information, the randomists concede, but the problem they see is that even experts do not know what they do not know. By focusing the attention of the system on some documents and not others, the judgmental approach potentially ignores large swaths of responsive information even while it does exceptionally well at finding others.

Therefore, the random approach samples every document in the collection with equal probability. This even-handed approach mitigates the problem of human bias and ensures that a wide set of starting points are selected. However, there is still no guarantee that a simple random sample will find those known pockets of responsive information about which the human assessor has more intimate knowledge.

At Catalyst, we recognize merits in both approaches. An ideal process would be one that combines the strengths of each to overcome the weakness of the other. One straightforward solution is to take the “more is more” approach and do both judgmental and random sampling. A combined sample not only has the advantage of human expertise, but also avoids some of the issues of bias.

However, while it is important to avoid bias, simple random sampling misses the point. Random sampling is good for estimating counts; it does not do as well at guaranteeing topical coverage (sussing out all pockets). The best way to avoid bias is not to pick “random” documents, but to select documents about which you know that you know very little. Let’s call it “diverse topical coverage.”

Remember the difference between the two goals: finding vs. validating. For validation, a statistically valid random sample is required. But for finding, we can be more intelligent than that. We can use intelligent algorithms to explicitly detect which documents we know the least about, no matter which other documents we already know something about. This is more than just simple random sampling, which has no guarantee to topically cover a collection. This is using algorithms to explicitly seek out those documents about which we know nothing or next to nothing. The Catalyst approach is therefore to not stand in the way of our clients by shoehorning them into a single sampling regimen for the purpose of finding. Rather, our clients may pick whatever documents that they want to judge, for whatever reason and “contextual diversity sampling” will detect any imbalances and help select the rest.

Examples of Finding

The following examples illustrate the performance of Catalyst’s intelligent algorithms with respect to the various points that were made in the previous section about random, judgmental, and contextual diversity sampling. In each of these examples, the horizontal x-axis represents the percentage of the collection that must be reviewed in order to find (on the y-axis) the given recall level using Catalyst’s Predictive Ranking algorithms.

For example, in this first graph we have a Predictive Ranking task with a significant number of responsive documents, a high richness. There are two lines, each representing a different initial seed condition: random versus judgmental. The first thing to note is that judgmental sampling starts slightly “ahead” of random sampling. The difference is not huge; the judgmental approach finds perhaps 2-3% more documents initially. That is to be expected, because the whole point of judgmental sampling is that the human can use his or her intelligence and insight into the case or domain to find documents that the computer is not capable of finding by strictly random sampling.

That brings us to the concern that judgmental sampling is biased and will not allow TAR algorithms to find all the documents. However, this chart shows that by using Catalyst’s intelligent iterative Predictive Ranking algorithms, both the judgmental and random initial sampling get to the same place. They both get about 80% of the available responsive documents after reviewing only 6% of the collection, 90% after reviewing about 12% of the collection, and so forth. Initial differences and biases are swallowed up by Catalyst’s intelligent Predictive Ranking algorithms.

In the second graph, we have a different matter in which the number of available responsive documents is over an order of magnitude less than in the previous example; the collection is very sparse. In this case, random sampling is not enough. A random sample does not find any responsive documents, so nothing can be learned by any algorithm. However, the judgmental sample does find a number of responsive documents, and even with this sparse matter, 85% of the available responsive documents may be found by only examining a little more than 6% of the collection.

However, a different story emerges when the user chooses to switch on contextual diversity sampling as part of the algorithmic learning process. In the previous example, contextual diversity was not needed. In this case, especially with the failure of the random sampling approach, it is. The following graph shows the results of both random sampling and judgmental sampling with contextual diversity activated, alongside the original results with no contextual diversity:

Adding contextual diversity to the judgmental seed has the effect of slowing learning in the initial phases. However, after only about 3.5% of the way through the collection, it catches up to the judgmental-only approach and even surpasses it. A 95% recall may be achieved a little less than 8% of the way through the collection. The results for adding contextual diversity to the random sampling are even more striking. It also catches up to judgmental sampling about 4% of the way through the collection and also surpasses it by the end, ending up at just over 90% recall a little less than 8% of the way through the collection.

These examples serve two primary purposes. First, they demonstrate that Catalyst’s iterative Predictive Ranking algorithms work, and work well. The vast majority of a collection does not need to be reviewed, because the Predictive Ranking algorithm finds 85%, 90%, 95% of all available responsive documents within only a few percent of the entire collection.

Second, these examples demonstrate that, no matter how you start, you will attain that good result. It is this second point that bears repeating and further consideration. Real-world e-discovery is messy. Collection is rolling. Deadlines are imminent. Experts are not always available when you need them to be available. It is not always feasible to start a TAR project in the clean, perfect, step-by-step manner that a vendor might require. Knowing that one can instead start either with judgmental samples or with random samples, and that the ability to add a contextual diversity option ensures that early shortcomings are not only mitigated but exceeded, is of critical importance to a TAR project.

Validating What You Have Found

Validating is an essential step in ensuring legal defensibility. There are multiple ways of doing it. Yes, there needs to be random sampling. Yes, it needs to be statistically significant. But there are different ways of structuring the random samples. The most common method is to do a simple random sample of the collection as a whole, and then another simple random sample of the documents that the machine has labeled as nonresponsive. If the richness of responsive documents in the latter sample has significantly decreased from the responsive-document richness in the initial whole population, then the process is considered to be valid.

However, at Catalyst we use a different procedure, one that we think is better at validating results. Like other methods, it also relies on random sampling. However, instead of doing a simple random sample of a set of documents, we use a systematic random sample of a ranking of documents. Instead of labeling documents first and sampling for richness second, the Catalyst procedure ranks all documents by their likelihood of being responsive. Only then is a random sample—a systematic random sample—taken.

At equal intervals across the entire list, samples are drawn. This gives Catalyst the ability to better estimate the concentration of responsive documents at every point in the list than an approach based on unordered simple random sampling. With this better estimate, a smarter decision boundary can be drawn between the responsive and nonresponsive documents. In addition, because the documents on either side of that boundary have already been systematically sampled, there is no need for a two-stage sampling procedure.

Workflow: Putting Finding and Validating Together

In the previous section, we introduced the two primary tasks involved in TAR: finding and validation. If machines (and humans, for that matter) were perfect, there would be no need for these two stages. There would only be a need for a single stage. For example, if a machine algorithm were known to perfectly find every responsive document in the collection, there would be no need to validate the algorithm’s output. And if a validation process could perfectly detect when all documents are correctly labeled, there would be no need to use an algorithm to find all the responsive ones; all possible configurations (combinatorial issues aside) could be tested until the correct one is found.

But no perfect solutions exist for either task, nor will they in the future. Thus, the reason for having a two-stage TAR process is so that each stage can provide checks and balances to the other. Validation ensures that finding is working, and finding ensures that validation will succeed.

Therefore, TAR requires some combination of both tasks. The manner in which both finding and validation are symbiotically combined is known as the e-discovery “workflow.” Workflow is a non-standard process that varies from vendor to vendor. For the most part, every vendor’s technology combines these tasks in a way that, ultimately, is defensible. However, defensibility is the minimum bar that must be cleared.

Some combinations might work more efficiently than others. Some combinations might work more effectively than others. And some workflows allow for more flexibility to meet the challenges of real world e-discovery, such as rolling collection.

We’ll discuss a standard model, typical of the industry, then review Catalyst’s approach, and finally conclude with the reason Catalyst’s approach is better. Hint: It’s not (only) about effectiveness, although we will show that it is that. Rather, it is about flexibility, which is crucial in the work environments in which lawyers and review teams use this technology.

Standard TAR Workflow

Most TAR technologies follow the same essential workflow. As we will explain, this standard workflow suffers from two weaknesses when applied in the context of real-world litigation. Here are the steps it entails:

  1. Estimate via simple random sampling how many responsive and nonresponsive docs there are in the collection (aka estimate whole population richness).
  2. Sample (and manually, subjectively code) documents.
  3. Feed those documents to a predictive coding engine to label the remainder of the collection.
  4. If manual intervention is needed to assist in the labeling (for example via threshold or rank-cutoff setting), do so at this point.
  5. Estimate via simple random sampling how many responsive documents there are in the set of documents that have been labeled in steps 3 and 4 as nonresponsive.
  6. Compare the estimate in step 5 with the estimate in step 1. If there has been a significant decrease in responsive richness, then the process as a whole is valid.

TAR as a whole relies on these six steps working as a harmonious process. However, each step is not done for the same reason. Steps 2-4 are for the purpose of finding and labeling. Steps 1, 5, and 6 are for the purpose of validation.

The first potential weakness in this standard workflow stems from the fact that the validation step is split into two parts, one at the very beginning and one at the very end. It is the relative comparison between the beginning and the end that gives this simple random-sampling-based workflow its validity. However, that also means that in order to establish validity, no new documents may arrive at any point after the workflow has started. Collection must be finished.

In real-world settings, collection is rarely complete at the outset. If new documents arrive after the whole-population richness estimate (step 1) is already done, then that estimate will no longer be statistically valid. And if that initial estimate is no longer valid, then the final estimates (step 5), which compare themselves to that initial estimate, will also not be valid. Thus, the process falls apart.

The second potential weakness in the standard workflow is that the manual intervention for threshold setting (step 4) occurs before the second (and final) random sampling (step 5). This is crucial to the manner in which the standard workflow operates. In order to compare before and after richness estimates (step 1 vs. step 5), concrete decisions will have had to be made about labels and decision boundaries. But in real-world settings, it may be premature to make concrete decisions at this point in the overall review.

How Catalyst’s Workflow Differs

In order to circumvent these weaknesses and match our process more closely to real-world litigation, Catalyst’s Predictive Ranking uses a proprietary, four-step workflow:

  1. Sample (and manually, subjectively code) documents.
  2. Feed those documents to our Predictive Ranking engine to rank the remainder of the collection.
  3. Estimate via a systematic random sample the relative concentration of responsive documents throughout the ranking created in step 2.
  4. Based on the concentration estimate from step 3, select a threshold or rank-cutoff setting which gives the desired recall and/or precision.

Once again, as with the standard predictive coding workflow, our Predictive Ranking as a whole relies on these four steps working as a harmonious process. However, each step is not done for the same reason. Steps 1 and 2 are for the purpose of finding and labeling. Steps 3 and 4 are for the purpose of validation.

Two important points should be noted about Catalyst’s workflow. The first is that the validation step is not split into two parts. Validation only happens at the very end of the entire workflow. If more documents arrive while documents are being found and labeled during steps 1 and 2 (i.e. if collection is rolling), the addition of new documents does not interfere with anything critical to the validation of the process. (Additional documents might make finding more difficult; finding is a separate issue from validating, one which Catalyst’s contextual diversity sampling algorithms are designed to address.)

The fact that validation in our workflow is not hampered by collections that are fluid and dynamic is significant. In real-world e-discovery situations, rolling collection is the norm. Our ability to handle this fluidity natively—by which we mean central to the way the workflow normally works, rather than as a tacked-on exception—is highly valuable to lawyers and review teams.

The second important point to note about Catalyst’s workflow is that the manual intervention for threshold setting (step 4) happens after the systematic random sample. At first it may seem counterintuitive as to why this is defensible, because choices about the labeling of documents are happening after a random sample has been taken. But the purpose of the systematic random sample is to estimate concentrations in a statistically valid manner. Since the concentration estimates themselves are valid, decisions made based on those concentrations are also valid.

Consequences and Benefits of the Catalyst Workflow

We already touched on two key ways in which the Catalyst Predictive Ranking workflow is unique from the industry standard workflow. It is important to understand what our workflow allows us—and you—to do:

  1. Get good results. Catalyst Predictive Ranking consistently demonstrates high scores for both precision and recall.
  2. Add more training samples, of any kind, at any time. That allows the flexibility of having judgmental samples without bias.
  3. Add more documents, of any kind, at any time. You don’t have to wait 158 days until all documents are collected. And you don’t have to repeat step 1 of the standard workflow when those additional documents arrive.
  4. Go through multiple stages of culling and filtering without hampering validation. In the standard workflow, that would destroy your baseline. This is not a concern with the Catalyst approach, which saves the validation to the very end, via the systematic sample.

Catalyst has more than four years of experience using Predictive Ranking techniques to target review and reduce document populations. Our algorithms are highly refined and highly effective. Even more important, however, is that our Predictive Ranking workflow has what other vendors’ workflows do not—the flexibility to accommodate real-world e-discovery. Out there in the trenches of litigation, e-discovery is a dynamic process. Whereas other vendors’ TAR workflows require a static collection, ours flows with the dynamics of your case.

In Search, Evaluation Drives Innovation; Or, What You Cannot Measure You Cannot Improve

Information retrieval researchers at Shonan last week. That's me in the center, wearing the yellow T-shirt.

Last week, I was honored to join a small group of information-retrieval researchers from around the world, from both industry and academia, who gathered at the Shonan Village Center in Kanagawa, Japan, to discuss issues surrounding the evaluation of whole session, interactive information retrieval. In this post, I introduce the purpose of this meeting. In later posts, I hope to further review the discussions that took place at Shonan and my own impressions.

Traditionally, information retrieval (a.k.a. search) has been viewed as a stateless, non-interactive process. The user issues a(n ad hoc) query to a search engine and the engine responds with its best attempt at answering that query, with results ranked by their likelihood of satisfying a user’s information need.

Interactive information retrieval, on the other hand, presumes multiple rounds of user-system exchange. The interactions during this exchange are presumed to be non-independent. Each query has some sort of relationship to previous queries, if only because the overall series is in support of the same user task or goal.

Examples of scenarios in which interactive information retrieval is necessary include travel or event planning, education and learning, seeking entertainment, and (of course) e-discovery. When queries are independent, the best the system can do is answer each query as if it were the last that the user will ask. However, when queries are non-independent, both the user and the system have the chance to engage in deeper and wider patterns of exploration.

Evaluating Interactive Information Retrieval

Evaluation of one-shot queries has a long and rich history. Concepts such as “binary relevance” and “precision and recall,” combined with batch mode evaluation, have led to countless advances in the state of the art. These advances, from the 1960s to the 1990s, allowed search engines, especially in a web context, to improve to the point at which they now bring huge benefits to society. Evaluation of interactive information retrieval tasks, on the other hand, does not have as-yet universally accepted metrics. The very nature of the interactivity (non-independence of a sequence of user actions and system responses) both gives the scenario its power and makes it difficult to evaluate.

The power, again, comes from the breadth and depth of what is made possible; the evaluation difficulty by this very same interdependence. When a single query is performed, it can be generally expected that the user traverses the results list in linear order, from estimated best to estimated worst result. And, with some probability, the user abandons the list traversal. These (generally realistic) assumptions allow the ad hoc, one-shot query to be evaluated in terms of the position of relevant documents within the list.

However, when multiple queries are performed, an element of non-determinism enters into the picture. A user typically does not examine all results in the list from the first query, then all results in the list from the second query, and so on. Instead, one user might only examine 57 results from the first query, 9 results from the second query, and then 82 results from the third query. Another user might examine 3 results from the first query, 18 results from the second query, and 17 results from the third query.

Furthermore, the order in which the results are seen by the user affects the next round of interactivity. That is, the second and third queries that are issued by an information seeker are influenced by which documents were seen during the first round of interaction. Even if two users started with the same first query, the user who looked at 57 results might have a very different notion of how to formulate the next query than the user who looked at only 3 results.

How, then, should these two users’ experiences with the interactive search engine be evaluated? Should it be the product or sum of the quality of the individual ranked lists for each query? That ignores the depth to which the user actually traveled in each list over the course of the session. Should evaluation instead be a function of the sequence of documents that the user actually saw during the course of the session, no matter which individual results list a document came from? That is better, but it still ignores the effects of document examination order on the queries that were issued — and more importantly on the queries that could have been issued, had the user traversed to either a shallower or deeper position within a particular list. The non-deterministic range of possibilities poses a severe challenge to the evaluation of interactive information retrieval.

Another issue related to whole-session evaluation in interactive information seeking has to do with progress during versus upon completion of an entire session. Should the primary focus of evaluation be to estimate the quality of a session only at the end of the user’s sequence of interactions? Or is it more important to have a metric which measures, i.e. expects, progress throughout a session? Inherent in the answer to this question is whether one expects interactive information retrieval progress to be linear. Is it? Should it be? The answer is an open question, one which we discussed at the Shonan Meeting.

Evaluation drives innovation. If you cannot measure something, you cannot improve it. The first step to improving interactive information retrieval systems is knowing what to measure and how to measure it. Only then will consistent improvements be possible.

Search Q&A: How ECA is ‘Broken’ and the Solution that will Fix It

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?

DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA.

Let’s be more specific: By ECA, I am referring to the practice of culling down a collection of unstructured documents–often by completely removing 50% of the documents or more–prior to going into active document searching and review. This practice is often carried out by using metadata (such as date or author), keywords or concepts, and removing documents that contain certain “obviously” non-relevant terms.

In theory, the idea is fantastic. It greatly reduces the cost of both hosting and of reviewing. Why search or review documents that are obviously non-relevant? Why not cut out as much as possible beforehand, so as to make the manual, labor-intensive stage as easy as possible? Web search engines do something similar; they have primary and secondary indexes. Content most likely to be relevant and useful to their users gets fed into the primary index. Content that is less relevant, or that looks like spam, remains in the secondary index. In this manner, the primary indexes are made smaller and faster, making the overall search process much better.

However, there is a key difference between ECA and the web engine practice of primary and secondary indexing.  In ECA, there is no secondary index. Documents that have been judged non-relevant on the sole basis of a few keywords or concepts or metadata are simply removed from the process completely, never to be revisited. Therein lies the problem.

I am an information-retrieval research scientist. One of the core precepts in my field is that a document will be relevant for only a few, very specific reasons, but non-relevant for dozens if not hundreds of reasons. The cor0llary to this is that there are many more keywords and concepts found in non-relevant documents that are also found in relevant documents than vice versa. That is, there is a higher probability that a keyword or concept found in a non-relevant document will also be found in a relevant document.

So what does that mean for ECA?  The problem arises if you are using keywords and concepts to filter out non-relevant documents without actually assessing them for relevance (i.e. without actually doing review). In that case, there is
a strong danger that the keywords and concepts you are using to do the filtering are also removing a number of relevant documents. And because you’re not doing what the web search engines do–creating a secondary index that can be revisted at a later point in time–but instead are completely removing those ECA’d documents from all further search and review, you’re losing those relevant documents forever.

When a Slam Dunk is a Smoking Gun

For example, one might be tempted to use ECA tools to filter out all documents that contain the terms “football,” “touchdown,” “49ers,” “Lakers,” “slam dunk,” “foul shot,” etc. Clearly these are all sports references and (let’s presume) sports emails are not relevant to the matter at hand but rather part of background office chatter. However, suppose the collection contains an email that says, “Cindy, I was able to reverse engineer competitor X’s code. I think this should make our new product offering a total slam dunk!” Or there might be another email that says, ”Hey, Jim, want to meet at the Tied House brew pub and catch the 49ers game after work on Monday? We can discuss our plans to fix the price of pork bellies.”

If the terms “49ers” and “slam dunk” have already been used during the ECA phase to completely remove every document that contains them, then these critical documents will be completely missed, putting the litigant at severe risk.

The solution, therefore, is to employ ECA in a manner that does not completely obliterate documents. Instead, ECA should be a tool for shifting certain sets of documents to a lower retrieval priority, a lower review priority or a secondary index. All of the documents should still be available. ECA simply helps with an intelligent prioritization of the searching and reviewing of those documents.

This approach allows the primary review to continue on as usual, with all the advantages of a pre-culled smaller number of documents. But if certain terms get discovered as part of that primary review process–terms such as “reverse engineer” or “pork bellies”–those terms can be used as queries into the secondary index. Then, the documents talking about meeting at the brew pub to watch the 49ers game and discuss the price fixing of pork bellies can still be recovered, despite having been pre-culled at an early stage. At the same time, if those ECA’d documents don’t contain ”pork bellies,” they still remain in the secondary index and do not disrupt the efficiency and effectiveness of the primary index. It is the best of both worlds.

In short, the problem with ECA today is that it draws hard boundaries–it makes permanent decisions about documents when it really shouldn’t. The solution is to make those boundaries softer, to treat ECA as a prioritization tool, or as a mechanism for shifting documents into tiered secondary and even tertiary indexes. In that manner, poor decisions made early on in the process, under the blindness of an ECA process, are not made permanent. They can be easily, automatically and effectively corrected.

Search Q&A: Learning to Read the ‘Signals’ Within Document Collections

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: What are “signals” and how can they improve search?

DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).

Smoke Signals by Frederic RemingtonBy itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.

On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential.

But signals can also be more abstract. Take the example of whether the 17th word in any particular document is capitalized. Generally, knowing this is probably not useful. But what if you knew that 30 of the past 35 documents that have been marked as responsive all contain a capitalized word at the 17th position and none of the non-responsive documents do? If you are able to identify that signal, then the signal can be amplified within the search algorithm itself so as to steer you towards additional documents with the same signal.

Signal selection, or determining which signals to measure and track, is an open problem. It is often domain dependent, if not matter dependent. There are some generally useful signals, such as word presence, word frequency, anchortext hyperlinks (in the case of web documents) or to/from “hyperlinks” (in the case of email). But determining what other signals to employ involves a mixture of intuition, mathematics, and experimentation. When it is done correctly, though, it yields huge gains in ranking algorithm effectiveness.

Search Q&A: The Six Blind Men and the E-Discovery Elephant

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of search algorithms out there. Why do you feel that collaboration is a better way to search?

DR. JEREMY PICKENS: Collaboration is a better way to search because e-discovery is not all about the algorithms. Algorithms also involve people.

In a previous post (Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?), I talked about why there will never be a magic bullet for e-discovery. That primarily has to do with the fact that an information need is typically never satisfied with just a single document, as it often is in web search. Rather, in e-discovery, hundreds and thousands of responsive documents must be found.

When there is that much information, it can be quite beneficial to have more than one person’s viewpoint. Every query is a different hypothesis about what is relevant, a different probe into the collection. More people working together means more viewpoints, which translate into a wider variety of probes.

An algorithm that is multi-searcher aware and tries to reconcile (look for both similarities among and gaps between) the various searcher activities is going to do a better job than an algorithm that only comes at the problem from one viewpoint.

Think of it with reference to that old story of the six blind men who wanted to know what an elephant looked like. The first man touched the elephant’s leg and declared, “The elephant is a pillar.” The second touched its tail and described it as like a rope. The third felt the trunk and said it was like a tree branch. The fourth felt the ear and thought it was like a big fan. The fifth touched the belly and asserted it was a thick wall. The sixth felt the tusk and contended the elephant was like a solid pipe.

Seeing that the blind men could not agree on what the elephant looked like, a passing wise man explained, “All of you are right. The reason every one of you is telling it differently is because each one of you touched a different part of the elephant. Actually, the elephant has all the features each of you found.”

In a sense, e-discovery search is like those blind men’s search of an elephant. Provided the searchers work collaboratively, then as each searcher touches and interprets a part, eventually the whole elephant emerges. In search, therein lies the benefit of collaboration.

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need.

Article Collaboration and Improvement DriveWith collaborative search, on the other hand, you are a member of a team consisting of at least one other person, possibly more. You are actively working together with that person to satisy a jointly held information need. My favorite example is of a couple looking to find a house or apartment. It does not help you to know that “people who bought this house also bought that house,” or that “people who live in this apartment also have lived in that apartment.” You are not going to move in together with all those people. You are going to move in with your partner.

And so as you are both searching for places to live, each of you enters different criteria about what is and is not important to you. You might like to live somewhere with great southern-facing exposure. Your partner might like a place with a garden. You might like a kitchen on the upper floor, and your partner might like enough work space in which to tinker on her motorcycle. A collaborative information seeking system should then attempt to find houses or apartments that satisfy both of your needs, jointly and simultaneously.

It is my belief that collaborative information seeking is much more appropriate to e-discovery than is collaborative filtering. Imagine collaborative filtering (“people who bought this also bought that”) in an e-discovery context: “People who have judged this document as responsive have also judged that document as responsive.” Of what value is it to know this? Given that someone else has already judged the document as responsive, why do I need to look at it? Unless I am doing quality control, it is simply a waste of time and client resources for the reviewer to judge again a document that has already been judged. Collaborative filtering falls apart in the e-discovery context, as it yields unnecessary repetition of labor. Collaborative filtering might work very well for toaster ovens, as you will still buy the toaster oven even if your friend has already bought the same model. It does not work well for e-discovery, as there is no sense in judging a document if your “friend” has already judged it.

By contrast, this is where collaborative search shines. Collaborative search allows you to find information that has not been viewed/judged/assessed by any member of your team of two or more people, but that is jointly relevant to the task that you are all working on, together. Collaborative search allows you and your team members jointly to push deeper into the collection, to documents that none of you would have likely found, were you working alone. Just as collaborative search allows you to find that house or apartment with both the southern exposure as well as the motorcycle workshop, it allows you to find documents that satisfy both the lead counsel’s as well as the review manager’s understanding of the task.

The Recommind Patent and the Need to Better Define ‘Predictive Coding’

Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL).  This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm).  The purpose of the workshop was, in part:

…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches.

Hot on the list of topics, of course, was predictive coding.  Much of the discussion centered around determining exactly what standards were needed not only to convince users of such systems that non-linear, smart review would save them time and money, but also to convince the courts (and lawyers who don’t want to receive sanctions from the courts) that such technology may be safely applied to a matter at hand while still meeting all the legal requirements of discovery.

So it was with keen interest that I noted the press release from a vendor, Recommind, that it had obtained a patent on the process of predictive coding itself.  Having been involved in writing a few patents in my time, my immediate thought was, “What exactly was patented, what are the specific claims? Is this going to be a broad patent, covering a high level process?  Or is it going to be a narrow patent, covering one or two specific ways of doing predictive coding?”

So I read the patent, and I read Recommind’s explanation, and I read the commentary, including Barry Murphy’s post, Dawn of the Predictive Coding Wars. First, from Murphy’s commentary:

According to Craig, the press release is “about more than terminology: it is about a process patent covering ‘systems and processes’ for iterative, computer-assisted review. Recommind believes it has long been on the record as to exactly what predictive coding is, and as a result of this patent, it expects competing vendors to follow suit accordingly, and stop claiming predictive coding capabilities they do not have.” Clearly, Recommind feels it has pioneered the concept of predictive coding and doesn’t want any competitors riding on coattails.

Second, from the explanation:

Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. …

The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan.  But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part.  Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.

OK, so predictive coding as a whole (and therefore the patent on predictive coding) is not a single technology, so much as it is a “process, workflow, and documentation.” Fine; I’ll accept that. However, nowhere in this post entitled “Predictive Coding Explained” were the process, workflow and documentation really ever explained. Great pain was taken to say what predictive coding was not (e.g. threading, clustering, etc. – which I agree with).   But no actual logical sequence of steps was given as to what predictive coding, at least from the perspective of this patent, was supposed to be.

For that, I had to turn to the patent itself. See Figure 5 in the patent (above), labeled “Predictive Coding Workflow.” See also Claim #1 (the top level independent patent claim).  That claim says that the patent covers a method for analyzing a plurality of documents, comprising:

(1) Receiving the plurality of documents via a computing device

(2) Receiving user input from the computing device, the user input including hard coding [aka labeling] of a subset of the plurality of documents, the hard coding based on an identified subject or category [e.g. responsiveness, privilege, or issue]

(3) Executing instructions stored in memory, that:

(a) generates an initial control set based on the subset of the plurality of documents and the received user input on the subset

(b) analyzes the initial control set to determine at least one seed set parameter associated with the identified subject or category

(c ) automatically codes a first portion of the plurality of documents, based on the initial control set and the at least one set seed parameter associated with the identified subject or category

(d) analyzes the first portion of the plurality of documents by applying an adaptive identification cycle, the adaptive identification cycle being based on the initial control set, user validation of the automatic coding of the first portion of the plurality of documents and confidence threshold validation

(e ) retrieves a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle on the first portion of the plurality of documents

(f) adds further documents to the plurality of documents on a rolling load basis, and conducts a random sampling of initial control set documents both on a static basis and the rolling load basis

(4) receiving user input via the computing device, the user input comprising inspection, analysis and hard coding of the randomly sampled initial control set documents, and

(5) executing instructions stored in memory , wherein execution of the instructions by the processor automatically codes documents based on the received user input regarding the randomly sampled initial control set documents

So that appears to be the primary workflow, the primary patented claim.  Let’s compare and contrast that workflow with that of traditional relevance feedback. Though relevance feedback dates back to the early 1970s, here is a passage from the Introduction to Information Retrieval (published in 2008) describing the basic workflow:

The idea of relevance feedback is to involve the user in the retrieval process so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is:

  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval results.
  • The user marks some returned documents as relevant or nonrelevant.
  • The system computes a better representation of the information need based on the user feedback.
  • The system displays a revised set of retrieval results.

Relevance feedback can go through one or more iterations of this sort.

In other words, the relevance feedback workflow seems to do everything that the predictive coding workflow does.  It starts with a collection of documents. It selects a subset of those documents in some manner.  It presents those documents to a human annotator for expert labeling. Based on the labels provided by the human, the algorithm goes through an “adaptive identification cycle” in which it modifies itself so as to better align itself with the human understanding of the document labels. And, based on this adapted algorithm, it revises the set of results. That is, it recomputes the probabilities of the labels (relevance or nonrelevant, responsive or nonresponsive) for all the results.  Finally, it should be noted that the traditional, decades-old relevance feedback process workflow also is capable of iteration.

So what is the difference? I don’t just ask this rhetorically. I see a very strong similarity in the overall workflows between both predictive coding and relevance feedback, so I would honestly and transparently like to understand where the crucial differences are. If we are to understand what Recommind believes predictive coding to be–and if this understanding is going to help the courts set the legal precedent for defensible use of these technologies, a goal in which I fully agree with Recommind–then we really need to understand the process as a whole and what makes it unique.

The only thing I can think of is that there are a few occasions in the claimed predictive coding workflow that integrate random sampling and this is most likely to insure that the process is defensible. If that is the case, then how does that differ from active learning? Here is an example of the active learning workflow which incorporates uncertainty-based sampling, from a 2007 academic research paper by Andreas Vlachos, “A Stopping Criterion for Active Learning“:

Input:

seed labelled data L, unlabelled data U,

batch size b

Initialization:

Train a model on L

Active Learning Loop:

Until a stopping criterion is satisfied:

Apply the trained model classifier on U

Rank the instances in U using the uncertainty of the model

Annotate the top b instances and add them to L

Train the model on the expanded L

That is, instead of just presenting the expert user (e.g. lawyer) with the documents that have the highest probability of responsiveness, or of privilege, or of whatever issue they’ve been coded for, an active learning process or workflow explicitly seeks to add those document instances about which the learning algorithm is the most uncertain. That could mean documents for which the probability of that document’s label is relatively even or undistinguished (highest entropy) across all classes (in the case of generative machine learning models) or documents which lie the nearest to a decision boundary (in the case of discriminative machine learning models).

However, it could also mean that a document doesn’t lie near any boundary or have any probability estimate associated with it, because the appropriate signals have not yet been added to the model. In such cases, the best way–nay even the only way–of doing uncertainty sampling is to randomly sample from the collection, as random sampling helps you discover those documents, and therefore those decision boundaries, that you otherwise would not be aware of.  Thus, active learning as a general workflow pattern also incorporates random sampling.

So again, it is still not clear to me exactly what makes the Recommind predictive coding workflow unique, what distinguishes it from methods that have gone before, what its core characteristics are.  That isn’t to say that they don’t exist.  However, I believe further discussion is warranted, both in public as well as at workshops such as DESI (http://www.umiacs.umd.edu/~oard/desi4/), as this will serve to advance the market as a whole.  That is, I agree with Barry Murphy over at eDiscovery Journal that:

No matter what, this is good news for the eDiscovery market as a whole.  One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what “predictive coding” capabilities they do/do not have. While that might be a side-effect, it’s more likely that Recommind is trying to take the heat around predictive coding and have it warm up the vendor’s prospects more than anything else. We at eDJ take this as a call to better define what predictive coding is and what solutions need to offer to be valuable.

I take this as a call for vendors not only to define exactly what “predictive coding” capabilities they do/do not have, but for the industry as a whole to begin to set court-friendly guidelines around what predictive coding truly is.

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.