Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices
Jeremy Pickens

About Jeremy Pickens

Jeremy Pickens, Ph.D., is one of the world's leading search scientists and a pioneer in the field of collaborative exploratory search, a form of search in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has six patents pending in the field of search and information retrieval, including two for collaborative exploratory search systems.

At Catalyst, Dr. Pickens researches and develops methods of using collaborative search to achieve more intelligent and precise results in e-discovery search and review. He also studies other ways to enhance search and review within the Catalyst system.

Dr. Pickens earned his master's and doctoral degrees at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King's College, London, on a joint grant with Goldsmiths University of London. As part of the OMRAS project (Online Music Recognition and Searching), he helped organize the first Music Information Retrieval (ISMIR) conference in Plymouth, Mass. Before joining Catalyst, Dr. Pickens spent five years as a research scientist at FX Palo Alto Lab, where his major research themes included video search and collaborative exploratory search.

Dr. Pickens is co-author of the forthcoming book, A Taxonomy of Collaborative Information Seeking, to be published by Morgan & Claypool Publishers. He was an editor of the spring 2010 special issue on collaborative information seeking of the journal Information Processing and Management. He is a frequent author and speaker on the topic.

Predictive Ranking: Technology Assisted Review Designed for the Real World

Why Predictive Ranking?

Most articles about technology assisted review (TAR) start with dire warnings about the explosion in electronic data. In most legal matters, however, the reality is that the quantity of data is big, but it is no explosion. The fact of the matter is that even a half million documents—a relatively small number in comparison to the “big data” of the web—pose a significant and serious challenge to a review team. That is a lot of documents and can cost a lot of money to review, especially if you have to go through them in a manual, linear fashion. Catalyst’s Predictive Ranking bypasses that linearity, helping you zero-in on the documents that matter most. But that is only part of what it does.

In the real world of e-discovery search and review, the challenges lawyers face come not merely from the explosion of data, but also from the constraints imposed by rolling collection, immediate deadlines, and non-standardized (and at times confusing) validation procedures. Overcoming these challenges is as much about process and workflow as it is about the technology that can be specifically crafted to enable that workflow. For these real-world challenges, Catalyst’s Predictive Ranking provides solutions that no other TAR process can offer.

In this article, we will give an overview of Catalyst’s Predictive Ranking and discuss how it differs from other TAR systems in its ability to respond to the dynamics of real-world litigation. But first, we will start with an overview of the TAR process and discuss some concepts that are key to understanding how it works.

What is Predictive Ranking?

Predictive Ranking is Catalyst’s proprietary TAR process. We developed it more than four years ago and have continued to refine and improve it ever since. It is the process used in our newly released product, Insight Predict.

In general, all the various forms of TAR share common denominators: machine learning, sampling, subjective coding of documents, and refinement. But at the end of the day, the basic concept of TAR is simple, in that it must accomplish only two essential tasks:

  1. Finding all (or “proportionally all”) responsive documents.
  2. Verifying that all (or “proportionally all”) responsive documents have been found.

That is it. For short, let us call these two goals “finding” and “validating.”

Finding Responsive Documents

Finding consists of two parts:

  1. Locating and selecting documents to label. By “label,” we mean manually mark them as responsive or nonresponsive.
  2. Propagating (via an algorithmic inference engine) these labels onto unseen documents.

This process of finding or searching for responsive documents is typically evaluated using two qualitative measures: precision and recall. Precision is a measure of the number of true hits (actually responsive documents) in the search compared against the total number of hits returned. Recall is a measure of the total true hits returned from the search against the actual number of true hits in the population.

One area of contention and disagreement among vendors is step 1, the sampling procedures used to train the algorithm in step 2. Vendors’ philosophies general fall into one of two camps, which loosely can be described as “judgmentalists” and “randomists.”

The judgmentalist approach assumes that litigation counsel (or the review manager) has the most insightful knowledge about the domain and matter and is therefore going to be the most effective at choosing training documents. The randomist approach, on the other hand, is concerned about bias. Expertise can help the system quickly find certain pockets of responsive information, the randomists concede, but the problem they see is that even experts do not know what they do not know. By focusing the attention of the system on some documents and not others, the judgmental approach potentially ignores large swaths of responsive information even while it does exceptionally well at finding others.

Therefore, the random approach samples every document in the collection with equal probability. This even-handed approach mitigates the problem of human bias and ensures that a wide set of starting points are selected. However, there is still no guarantee that a simple random sample will find those known pockets of responsive information about which the human assessor has more intimate knowledge.

At Catalyst, we recognize merits in both approaches. An ideal process would be one that combines the strengths of each to overcome the weakness of the other. One straightforward solution is to take the “more is more” approach and do both judgmental and random sampling. A combined sample not only has the advantage of human expertise, but also avoids some of the issues of bias.

However, while it is important to avoid bias, simple random sampling misses the point. Random sampling is good for estimating counts; it does not do as well at guaranteeing topical coverage (sussing out all pockets). The best way to avoid bias is not to pick “random” documents, but to select documents about which you know that you know very little. Let’s call it “diverse topical coverage.”

Remember the difference between the two goals: finding vs. validating. For validation, a statistically valid random sample is required. But for finding, we can be more intelligent than that. We can use intelligent algorithms to explicitly detect which documents we know the least about, no matter which other documents we already know something about. This is more than just simple random sampling, which has no guarantee to topically cover a collection. This is using algorithms to explicitly seek out those documents about which we know nothing or next to nothing. The Catalyst approach is therefore to not stand in the way of our clients by shoehorning them into a single sampling regimen for the purpose of finding. Rather, our clients may pick whatever documents that they want to judge, for whatever reason and “contextual diversity sampling” will detect any imbalances and help select the rest.

Examples of Finding

The following examples illustrate the performance of Catalyst’s intelligent algorithms with respect to the various points that were made in the previous section about random, judgmental, and contextual diversity sampling. In each of these examples, the horizontal x-axis represents the percentage of the collection that must be reviewed in order to find (on the y-axis) the given recall level using Catalyst’s Predictive Ranking algorithms.

For example, in this first graph we have a Predictive Ranking task with a significant number of responsive documents, a high richness. There are two lines, each representing a different initial seed condition: random versus judgmental. The first thing to note is that judgmental sampling starts slightly “ahead” of random sampling. The difference is not huge; the judgmental approach finds perhaps 2-3% more documents initially. That is to be expected, because the whole point of judgmental sampling is that the human can use his or her intelligence and insight into the case or domain to find documents that the computer is not capable of finding by strictly random sampling.

That brings us to the concern that judgmental sampling is biased and will not allow TAR algorithms to find all the documents. However, this chart shows that by using Catalyst’s intelligent iterative Predictive Ranking algorithms, both the judgmental and random initial sampling get to the same place. They both get about 80% of the available responsive documents after reviewing only 6% of the collection, 90% after reviewing about 12% of the collection, and so forth. Initial differences and biases are swallowed up by Catalyst’s intelligent Predictive Ranking algorithms.

In the second graph, we have a different matter in which the number of available responsive documents is over an order of magnitude less than in the previous example; the collection is very sparse. In this case, random sampling is not enough. A random sample does not find any responsive documents, so nothing can be learned by any algorithm. However, the judgmental sample does find a number of responsive documents, and even with this sparse matter, 85% of the available responsive documents may be found by only examining a little more than 6% of the collection.

However, a different story emerges when the user chooses to switch on contextual diversity sampling as part of the algorithmic learning process. In the previous example, contextual diversity was not needed. In this case, especially with the failure of the random sampling approach, it is. The following graph shows the results of both random sampling and judgmental sampling with contextual diversity activated, alongside the original results with no contextual diversity:

Adding contextual diversity to the judgmental seed has the effect of slowing learning in the initial phases. However, after only about 3.5% of the way through the collection, it catches up to the judgmental-only approach and even surpasses it. A 95% recall may be achieved a little less than 8% of the way through the collection. The results for adding contextual diversity to the random sampling are even more striking. It also catches up to judgmental sampling about 4% of the way through the collection and also surpasses it by the end, ending up at just over 90% recall a little less than 8% of the way through the collection.

These examples serve two primary purposes. First, they demonstrate that Catalyst’s iterative Predictive Ranking algorithms work, and work well. The vast majority of a collection does not need to be reviewed, because the Predictive Ranking algorithm finds 85%, 90%, 95% of all available responsive documents within only a few percent of the entire collection.

Second, these examples demonstrate that, no matter how you start, you will attain that good result. It is this second point that bears repeating and further consideration. Real-world e-discovery is messy. Collection is rolling. Deadlines are imminent. Experts are not always available when you need them to be available. It is not always feasible to start a TAR project in the clean, perfect, step-by-step manner that a vendor might require. Knowing that one can instead start either with judgmental samples or with random samples, and that the ability to add a contextual diversity option ensures that early shortcomings are not only mitigated but exceeded, is of critical importance to a TAR project.

Validating What You Have Found

Validating is an essential step in ensuring legal defensibility. There are multiple ways of doing it. Yes, there needs to be random sampling. Yes, it needs to be statistically significant. But there are different ways of structuring the random samples. The most common method is to do a simple random sample of the collection as a whole, and then another simple random sample of the documents that the machine has labeled as nonresponsive. If the richness of responsive documents in the latter sample has significantly decreased from the responsive-document richness in the initial whole population, then the process is considered to be valid.

However, at Catalyst we use a different procedure, one that we think is better at validating results. Like other methods, it also relies on random sampling. However, instead of doing a simple random sample of a set of documents, we use a systematic random sample of a ranking of documents. Instead of labeling documents first and sampling for richness second, the Catalyst procedure ranks all documents by their likelihood of being responsive. Only then is a random sample—a systematic random sample—taken.

At equal intervals across the entire list, samples are drawn. This gives Catalyst the ability to better estimate the concentration of responsive documents at every point in the list than an approach based on unordered simple random sampling. With this better estimate, a smarter decision boundary can be drawn between the responsive and nonresponsive documents. In addition, because the documents on either side of that boundary have already been systematically sampled, there is no need for a two-stage sampling procedure.

Workflow: Putting Finding and Validating Together

In the previous section, we introduced the two primary tasks involved in TAR: finding and validation. If machines (and humans, for that matter) were perfect, there would be no need for these two stages. There would only be a need for a single stage. For example, if a machine algorithm were known to perfectly find every responsive document in the collection, there would be no need to validate the algorithm’s output. And if a validation process could perfectly detect when all documents are correctly labeled, there would be no need to use an algorithm to find all the responsive ones; all possible configurations (combinatorial issues aside) could be tested until the correct one is found.

But no perfect solutions exist for either task, nor will they in the future. Thus, the reason for having a two-stage TAR process is so that each stage can provide checks and balances to the other. Validation ensures that finding is working, and finding ensures that validation will succeed.

Therefore, TAR requires some combination of both tasks. The manner in which both finding and validation are symbiotically combined is known as the e-discovery “workflow.” Workflow is a non-standard process that varies from vendor to vendor. For the most part, every vendor’s technology combines these tasks in a way that, ultimately, is defensible. However, defensibility is the minimum bar that must be cleared.

Some combinations might work more efficiently than others. Some combinations might work more effectively than others. And some workflows allow for more flexibility to meet the challenges of real world e-discovery, such as rolling collection.

We’ll discuss a standard model, typical of the industry, then review Catalyst’s approach, and finally conclude with the reason Catalyst’s approach is better. Hint: It’s not (only) about effectiveness, although we will show that it is that. Rather, it is about flexibility, which is crucial in the work environments in which lawyers and review teams use this technology.

Standard TAR Workflow

Most TAR technologies follow the same essential workflow. As we will explain, this standard workflow suffers from two weaknesses when applied in the context of real-world litigation. Here are the steps it entails:

  1. Estimate via simple random sampling how many responsive and nonresponsive docs there are in the collection (aka estimate whole population richness).
  2. Sample (and manually, subjectively code) documents.
  3. Feed those documents to a predictive coding engine to label the remainder of the collection.
  4. If manual intervention is needed to assist in the labeling (for example via threshold or rank-cutoff setting), do so at this point.
  5. Estimate via simple random sampling how many responsive documents there are in the set of documents that have been labeled in steps 3 and 4 as nonresponsive.
  6. Compare the estimate in step 5 with the estimate in step 1. If there has been a significant decrease in responsive richness, then the process as a whole is valid.

TAR as a whole relies on these six steps working as a harmonious process. However, each step is not done for the same reason. Steps 2-4 are for the purpose of finding and labeling. Steps 1, 5, and 6 are for the purpose of validation.

The first potential weakness in this standard workflow stems from the fact that the validation step is split into two parts, one at the very beginning and one at the very end. It is the relative comparison between the beginning and the end that gives this simple random-sampling-based workflow its validity. However, that also means that in order to establish validity, no new documents may arrive at any point after the workflow has started. Collection must be finished.

In real-world settings, collection is rarely complete at the outset. If new documents arrive after the whole-population richness estimate (step 1) is already done, then that estimate will no longer be statistically valid. And if that initial estimate is no longer valid, then the final estimates (step 5), which compare themselves to that initial estimate, will also not be valid. Thus, the process falls apart.

The second potential weakness in the standard workflow is that the manual intervention for threshold setting (step 4) occurs before the second (and final) random sampling (step 5). This is crucial to the manner in which the standard workflow operates. In order to compare before and after richness estimates (step 1 vs. step 5), concrete decisions will have had to be made about labels and decision boundaries. But in real-world settings, it may be premature to make concrete decisions at this point in the overall review.

How Catalyst’s Workflow Differs

In order to circumvent these weaknesses and match our process more closely to real-world litigation, Catalyst’s Predictive Ranking uses a proprietary, four-step workflow:

  1. Sample (and manually, subjectively code) documents.
  2. Feed those documents to our Predictive Ranking engine to rank the remainder of the collection.
  3. Estimate via a systematic random sample the relative concentration of responsive documents throughout the ranking created in step 2.
  4. Based on the concentration estimate from step 3, select a threshold or rank-cutoff setting which gives the desired recall and/or precision.

Once again, as with the standard predictive coding workflow, our Predictive Ranking as a whole relies on these four steps working as a harmonious process. However, each step is not done for the same reason. Steps 1 and 2 are for the purpose of finding and labeling. Steps 3 and 4 are for the purpose of validation.

Two important points should be noted about Catalyst’s workflow. The first is that the validation step is not split into two parts. Validation only happens at the very end of the entire workflow. If more documents arrive while documents are being found and labeled during steps 1 and 2 (i.e. if collection is rolling), the addition of new documents does not interfere with anything critical to the validation of the process. (Additional documents might make finding more difficult; finding is a separate issue from validating, one which Catalyst’s contextual diversity sampling algorithms are designed to address.)

The fact that validation in our workflow is not hampered by collections that are fluid and dynamic is significant. In real-world e-discovery situations, rolling collection is the norm. Our ability to handle this fluidity natively—by which we mean central to the way the workflow normally works, rather than as a tacked-on exception—is highly valuable to lawyers and review teams.

The second important point to note about Catalyst’s workflow is that the manual intervention for threshold setting (step 4) happens after the systematic random sample. At first it may seem counterintuitive as to why this is defensible, because choices about the labeling of documents are happening after a random sample has been taken. But the purpose of the systematic random sample is to estimate concentrations in a statistically valid manner. Since the concentration estimates themselves are valid, decisions made based on those concentrations are also valid.

Consequences and Benefits of the Catalyst Workflow

We already touched on two key ways in which the Catalyst Predictive Ranking workflow is unique from the industry standard workflow. It is important to understand what our workflow allows us—and you—to do:

  1. Get good results. Catalyst Predictive Ranking consistently demonstrates high scores for both precision and recall.
  2. Add more training samples, of any kind, at any time. That allows the flexibility of having judgmental samples without bias.
  3. Add more documents, of any kind, at any time. You don’t have to wait 158 days until all documents are collected. And you don’t have to repeat step 1 of the standard workflow when those additional documents arrive.
  4. Go through multiple stages of culling and filtering without hampering validation. In the standard workflow, that would destroy your baseline. This is not a concern with the Catalyst approach, which saves the validation to the very end, via the systematic sample.

Catalyst has more than four years of experience using Predictive Ranking techniques to target review and reduce document populations. Our algorithms are highly refined and highly effective. Even more important, however, is that our Predictive Ranking workflow has what other vendors’ workflows do not—the flexibility to accommodate real-world e-discovery. Out there in the trenches of litigation, e-discovery is a dynamic process. Whereas other vendors’ TAR workflows require a static collection, ours flows with the dynamics of your case.

In Search, Evaluation Drives Innovation; Or, What You Cannot Measure You Cannot Improve

Information retrieval researchers at Shonan last week. That's me in the center, wearing the yellow T-shirt.

Last week, I was honored to join a small group of information-retrieval researchers from around the world, from both industry and academia, who gathered at the Shonan Village Center in Kanagawa, Japan, to discuss issues surrounding the evaluation of whole session, interactive information retrieval. In this post, I introduce the purpose of this meeting. In later posts, I hope to further review the discussions that took place at Shonan and my own impressions.

Traditionally, information retrieval (a.k.a. search) has been viewed as a stateless, non-interactive process. The user issues a(n ad hoc) query to a search engine and the engine responds with its best attempt at answering that query, with results ranked by their likelihood of satisfying a user’s information need.

Interactive information retrieval, on the other hand, presumes multiple rounds of user-system exchange. The interactions during this exchange are presumed to be non-independent. Each query has some sort of relationship to previous queries, if only because the overall series is in support of the same user task or goal.

Examples of scenarios in which interactive information retrieval is necessary include travel or event planning, education and learning, seeking entertainment, and (of course) e-discovery. When queries are independent, the best the system can do is answer each query as if it were the last that the user will ask. However, when queries are non-independent, both the user and the system have the chance to engage in deeper and wider patterns of exploration.

Evaluating Interactive Information Retrieval

Evaluation of one-shot queries has a long and rich history. Concepts such as “binary relevance” and “precision and recall,” combined with batch mode evaluation, have led to countless advances in the state of the art. These advances, from the 1960s to the 1990s, allowed search engines, especially in a web context, to improve to the point at which they now bring huge benefits to society. Evaluation of interactive information retrieval tasks, on the other hand, does not have as-yet universally accepted metrics. The very nature of the interactivity (non-independence of a sequence of user actions and system responses) both gives the scenario its power and makes it difficult to evaluate.

The power, again, comes from the breadth and depth of what is made possible; the evaluation difficulty by this very same interdependence. When a single query is performed, it can be generally expected that the user traverses the results list in linear order, from estimated best to estimated worst result. And, with some probability, the user abandons the list traversal. These (generally realistic) assumptions allow the ad hoc, one-shot query to be evaluated in terms of the position of relevant documents within the list.

However, when multiple queries are performed, an element of non-determinism enters into the picture. A user typically does not examine all results in the list from the first query, then all results in the list from the second query, and so on. Instead, one user might only examine 57 results from the first query, 9 results from the second query, and then 82 results from the third query. Another user might examine 3 results from the first query, 18 results from the second query, and 17 results from the third query.

Furthermore, the order in which the results are seen by the user affects the next round of interactivity. That is, the second and third queries that are issued by an information seeker are influenced by which documents were seen during the first round of interaction. Even if two users started with the same first query, the user who looked at 57 results might have a very different notion of how to formulate the next query than the user who looked at only 3 results.

How, then, should these two users’ experiences with the interactive search engine be evaluated? Should it be the product or sum of the quality of the individual ranked lists for each query? That ignores the depth to which the user actually traveled in each list over the course of the session. Should evaluation instead be a function of the sequence of documents that the user actually saw during the course of the session, no matter which individual results list a document came from? That is better, but it still ignores the effects of document examination order on the queries that were issued — and more importantly on the queries that could have been issued, had the user traversed to either a shallower or deeper position within a particular list. The non-deterministic range of possibilities poses a severe challenge to the evaluation of interactive information retrieval.

Another issue related to whole-session evaluation in interactive information seeking has to do with progress during versus upon completion of an entire session. Should the primary focus of evaluation be to estimate the quality of a session only at the end of the user’s sequence of interactions? Or is it more important to have a metric which measures, i.e. expects, progress throughout a session? Inherent in the answer to this question is whether one expects interactive information retrieval progress to be linear. Is it? Should it be? The answer is an open question, one which we discussed at the Shonan Meeting.

Evaluation drives innovation. If you cannot measure something, you cannot improve it. The first step to improving interactive information retrieval systems is knowing what to measure and how to measure it. Only then will consistent improvements be possible.

Search Q&A: How ECA is ‘Broken’ and the Solution that will Fix It

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?

DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA.

Let’s be more specific: By ECA, I am referring to the practice of culling down a collection of unstructured documents–often by completely removing 50% of the documents or more–prior to going into active document searching and review. This practice is often carried out by using metadata (such as date or author), keywords or concepts, and removing documents that contain certain “obviously” non-relevant terms.

In theory, the idea is fantastic. It greatly reduces the cost of both hosting and of reviewing. Why search or review documents that are obviously non-relevant? Why not cut out as much as possible beforehand, so as to make the manual, labor-intensive stage as easy as possible? Web search engines do something similar; they have primary and secondary indexes. Content most likely to be relevant and useful to their users gets fed into the primary index. Content that is less relevant, or that looks like spam, remains in the secondary index. In this manner, the primary indexes are made smaller and faster, making the overall search process much better.

However, there is a key difference between ECA and the web engine practice of primary and secondary indexing.  In ECA, there is no secondary index. Documents that have been judged non-relevant on the sole basis of a few keywords or concepts or metadata are simply removed from the process completely, never to be revisited. Therein lies the problem.

I am an information-retrieval research scientist. One of the core precepts in my field is that a document will be relevant for only a few, very specific reasons, but non-relevant for dozens if not hundreds of reasons. The cor0llary to this is that there are many more keywords and concepts found in non-relevant documents that are also found in relevant documents than vice versa. That is, there is a higher probability that a keyword or concept found in a non-relevant document will also be found in a relevant document.

So what does that mean for ECA?  The problem arises if you are using keywords and concepts to filter out non-relevant documents without actually assessing them for relevance (i.e. without actually doing review). In that case, there is
a strong danger that the keywords and concepts you are using to do the filtering are also removing a number of relevant documents. And because you’re not doing what the web search engines do–creating a secondary index that can be revisted at a later point in time–but instead are completely removing those ECA’d documents from all further search and review, you’re losing those relevant documents forever.

When a Slam Dunk is a Smoking Gun

For example, one might be tempted to use ECA tools to filter out all documents that contain the terms “football,” “touchdown,” “49ers,” “Lakers,” “slam dunk,” “foul shot,” etc. Clearly these are all sports references and (let’s presume) sports emails are not relevant to the matter at hand but rather part of background office chatter. However, suppose the collection contains an email that says, “Cindy, I was able to reverse engineer competitor X’s code. I think this should make our new product offering a total slam dunk!” Or there might be another email that says, ”Hey, Jim, want to meet at the Tied House brew pub and catch the 49ers game after work on Monday? We can discuss our plans to fix the price of pork bellies.”

If the terms “49ers” and “slam dunk” have already been used during the ECA phase to completely remove every document that contains them, then these critical documents will be completely missed, putting the litigant at severe risk.

The solution, therefore, is to employ ECA in a manner that does not completely obliterate documents. Instead, ECA should be a tool for shifting certain sets of documents to a lower retrieval priority, a lower review priority or a secondary index. All of the documents should still be available. ECA simply helps with an intelligent prioritization of the searching and reviewing of those documents.

This approach allows the primary review to continue on as usual, with all the advantages of a pre-culled smaller number of documents. But if certain terms get discovered as part of that primary review process–terms such as “reverse engineer” or “pork bellies”–those terms can be used as queries into the secondary index. Then, the documents talking about meeting at the brew pub to watch the 49ers game and discuss the price fixing of pork bellies can still be recovered, despite having been pre-culled at an early stage. At the same time, if those ECA’d documents don’t contain ”pork bellies,” they still remain in the secondary index and do not disrupt the efficiency and effectiveness of the primary index. It is the best of both worlds.

In short, the problem with ECA today is that it draws hard boundaries–it makes permanent decisions about documents when it really shouldn’t. The solution is to make those boundaries softer, to treat ECA as a prioritization tool, or as a mechanism for shifting documents into tiered secondary and even tertiary indexes. In that manner, poor decisions made early on in the process, under the blindness of an ECA process, are not made permanent. They can be easily, automatically and effectively corrected.

Search Q&A: Learning to Read the ‘Signals’ Within Document Collections

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: What are “signals” and how can they improve search?

DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).

Smoke Signals by Frederic RemingtonBy itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.

On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential.

But signals can also be more abstract. Take the example of whether the 17th word in any particular document is capitalized. Generally, knowing this is probably not useful. But what if you knew that 30 of the past 35 documents that have been marked as responsive all contain a capitalized word at the 17th position and none of the non-responsive documents do? If you are able to identify that signal, then the signal can be amplified within the search algorithm itself so as to steer you towards additional documents with the same signal.

Signal selection, or determining which signals to measure and track, is an open problem. It is often domain dependent, if not matter dependent. There are some generally useful signals, such as word presence, word frequency, anchortext hyperlinks (in the case of web documents) or to/from “hyperlinks” (in the case of email). But determining what other signals to employ involves a mixture of intuition, mathematics, and experimentation. When it is done correctly, though, it yields huge gains in ranking algorithm effectiveness.

Search Q&A: The Six Blind Men and the E-Discovery Elephant

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of search algorithms out there. Why do you feel that collaboration is a better way to search?

DR. JEREMY PICKENS: Collaboration is a better way to search because e-discovery is not all about the algorithms. Algorithms also involve people.

In a previous post (Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?), I talked about why there will never be a magic bullet for e-discovery. That primarily has to do with the fact that an information need is typically never satisfied with just a single document, as it often is in web search. Rather, in e-discovery, hundreds and thousands of responsive documents must be found.

When there is that much information, it can be quite beneficial to have more than one person’s viewpoint. Every query is a different hypothesis about what is relevant, a different probe into the collection. More people working together means more viewpoints, which translate into a wider variety of probes.

An algorithm that is multi-searcher aware and tries to reconcile (look for both similarities among and gaps between) the various searcher activities is going to do a better job than an algorithm that only comes at the problem from one viewpoint.

Think of it with reference to that old story of the six blind men who wanted to know what an elephant looked like. The first man touched the elephant’s leg and declared, “The elephant is a pillar.” The second touched its tail and described it as like a rope. The third felt the trunk and said it was like a tree branch. The fourth felt the ear and thought it was like a big fan. The fifth touched the belly and asserted it was a thick wall. The sixth felt the tusk and contended the elephant was like a solid pipe.

Seeing that the blind men could not agree on what the elephant looked like, a passing wise man explained, “All of you are right. The reason every one of you is telling it differently is because each one of you touched a different part of the elephant. Actually, the elephant has all the features each of you found.”

In a sense, e-discovery search is like those blind men’s search of an elephant. Provided the searchers work collaboratively, then as each searcher touches and interprets a part, eventually the whole elephant emerges. In search, therein lies the benefit of collaboration.

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need.

Article Collaboration and Improvement DriveWith collaborative search, on the other hand, you are a member of a team consisting of at least one other person, possibly more. You are actively working together with that person to satisy a jointly held information need. My favorite example is of a couple looking to find a house or apartment. It does not help you to know that “people who bought this house also bought that house,” or that “people who live in this apartment also have lived in that apartment.” You are not going to move in together with all those people. You are going to move in with your partner.

And so as you are both searching for places to live, each of you enters different criteria about what is and is not important to you. You might like to live somewhere with great southern-facing exposure. Your partner might like a place with a garden. You might like a kitchen on the upper floor, and your partner might like enough work space in which to tinker on her motorcycle. A collaborative information seeking system should then attempt to find houses or apartments that satisfy both of your needs, jointly and simultaneously.

It is my belief that collaborative information seeking is much more appropriate to e-discovery than is collaborative filtering. Imagine collaborative filtering (“people who bought this also bought that”) in an e-discovery context: “People who have judged this document as responsive have also judged that document as responsive.” Of what value is it to know this? Given that someone else has already judged the document as responsive, why do I need to look at it? Unless I am doing quality control, it is simply a waste of time and client resources for the reviewer to judge again a document that has already been judged. Collaborative filtering falls apart in the e-discovery context, as it yields unnecessary repetition of labor. Collaborative filtering might work very well for toaster ovens, as you will still buy the toaster oven even if your friend has already bought the same model. It does not work well for e-discovery, as there is no sense in judging a document if your “friend” has already judged it.

By contrast, this is where collaborative search shines. Collaborative search allows you to find information that has not been viewed/judged/assessed by any member of your team of two or more people, but that is jointly relevant to the task that you are all working on, together. Collaborative search allows you and your team members jointly to push deeper into the collection, to documents that none of you would have likely found, were you working alone. Just as collaborative search allows you to find that house or apartment with both the southern exposure as well as the motorcycle workshop, it allows you to find documents that satisfy both the lead counsel’s as well as the review manager’s understanding of the task.

The Recommind Patent and the Need to Better Define ‘Predictive Coding’

Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL).  This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm).  The purpose of the workshop was, in part:

…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches.

Hot on the list of topics, of course, was predictive coding.  Much of the discussion centered around determining exactly what standards were needed not only to convince users of such systems that non-linear, smart review would save them time and money, but also to convince the courts (and lawyers who don’t want to receive sanctions from the courts) that such technology may be safely applied to a matter at hand while still meeting all the legal requirements of discovery.

So it was with keen interest that I noted the press release from a vendor, Recommind, that it had obtained a patent on the process of predictive coding itself.  Having been involved in writing a few patents in my time, my immediate thought was, “What exactly was patented, what are the specific claims? Is this going to be a broad patent, covering a high level process?  Or is it going to be a narrow patent, covering one or two specific ways of doing predictive coding?”

So I read the patent, and I read Recommind’s explanation, and I read the commentary, including Barry Murphy’s post, Dawn of the Predictive Coding Wars. First, from Murphy’s commentary:

According to Craig, the press release is “about more than terminology: it is about a process patent covering ‘systems and processes’ for iterative, computer-assisted review. Recommind believes it has long been on the record as to exactly what predictive coding is, and as a result of this patent, it expects competing vendors to follow suit accordingly, and stop claiming predictive coding capabilities they do not have.” Clearly, Recommind feels it has pioneered the concept of predictive coding and doesn’t want any competitors riding on coattails.

Second, from the explanation:

Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. …

The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan.  But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part.  Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.

OK, so predictive coding as a whole (and therefore the patent on predictive coding) is not a single technology, so much as it is a “process, workflow, and documentation.” Fine; I’ll accept that. However, nowhere in this post entitled “Predictive Coding Explained” were the process, workflow and documentation really ever explained. Great pain was taken to say what predictive coding was not (e.g. threading, clustering, etc. – which I agree with).   But no actual logical sequence of steps was given as to what predictive coding, at least from the perspective of this patent, was supposed to be.

For that, I had to turn to the patent itself. See Figure 5 in the patent (above), labeled “Predictive Coding Workflow.” See also Claim #1 (the top level independent patent claim).  That claim says that the patent covers a method for analyzing a plurality of documents, comprising:

(1) Receiving the plurality of documents via a computing device

(2) Receiving user input from the computing device, the user input including hard coding [aka labeling] of a subset of the plurality of documents, the hard coding based on an identified subject or category [e.g. responsiveness, privilege, or issue]

(3) Executing instructions stored in memory, that:

(a) generates an initial control set based on the subset of the plurality of documents and the received user input on the subset

(b) analyzes the initial control set to determine at least one seed set parameter associated with the identified subject or category

(c ) automatically codes a first portion of the plurality of documents, based on the initial control set and the at least one set seed parameter associated with the identified subject or category

(d) analyzes the first portion of the plurality of documents by applying an adaptive identification cycle, the adaptive identification cycle being based on the initial control set, user validation of the automatic coding of the first portion of the plurality of documents and confidence threshold validation

(e ) retrieves a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle on the first portion of the plurality of documents

(f) adds further documents to the plurality of documents on a rolling load basis, and conducts a random sampling of initial control set documents both on a static basis and the rolling load basis

(4) receiving user input via the computing device, the user input comprising inspection, analysis and hard coding of the randomly sampled initial control set documents, and

(5) executing instructions stored in memory , wherein execution of the instructions by the processor automatically codes documents based on the received user input regarding the randomly sampled initial control set documents

So that appears to be the primary workflow, the primary patented claim.  Let’s compare and contrast that workflow with that of traditional relevance feedback. Though relevance feedback dates back to the early 1970s, here is a passage from the Introduction to Information Retrieval (published in 2008) describing the basic workflow:

The idea of relevance feedback is to involve the user in the retrieval process so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is:

  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval results.
  • The user marks some returned documents as relevant or nonrelevant.
  • The system computes a better representation of the information need based on the user feedback.
  • The system displays a revised set of retrieval results.

Relevance feedback can go through one or more iterations of this sort.

In other words, the relevance feedback workflow seems to do everything that the predictive coding workflow does.  It starts with a collection of documents. It selects a subset of those documents in some manner.  It presents those documents to a human annotator for expert labeling. Based on the labels provided by the human, the algorithm goes through an “adaptive identification cycle” in which it modifies itself so as to better align itself with the human understanding of the document labels. And, based on this adapted algorithm, it revises the set of results. That is, it recomputes the probabilities of the labels (relevance or nonrelevant, responsive or nonresponsive) for all the results.  Finally, it should be noted that the traditional, decades-old relevance feedback process workflow also is capable of iteration.

So what is the difference? I don’t just ask this rhetorically. I see a very strong similarity in the overall workflows between both predictive coding and relevance feedback, so I would honestly and transparently like to understand where the crucial differences are. If we are to understand what Recommind believes predictive coding to be–and if this understanding is going to help the courts set the legal precedent for defensible use of these technologies, a goal in which I fully agree with Recommind–then we really need to understand the process as a whole and what makes it unique.

The only thing I can think of is that there are a few occasions in the claimed predictive coding workflow that integrate random sampling and this is most likely to insure that the process is defensible. If that is the case, then how does that differ from active learning? Here is an example of the active learning workflow which incorporates uncertainty-based sampling, from a 2007 academic research paper by Andreas Vlachos, “A Stopping Criterion for Active Learning“:

Input:

seed labelled data L, unlabelled data U,

batch size b

Initialization:

Train a model on L

Active Learning Loop:

Until a stopping criterion is satisfied:

Apply the trained model classifier on U

Rank the instances in U using the uncertainty of the model

Annotate the top b instances and add them to L

Train the model on the expanded L

That is, instead of just presenting the expert user (e.g. lawyer) with the documents that have the highest probability of responsiveness, or of privilege, or of whatever issue they’ve been coded for, an active learning process or workflow explicitly seeks to add those document instances about which the learning algorithm is the most uncertain. That could mean documents for which the probability of that document’s label is relatively even or undistinguished (highest entropy) across all classes (in the case of generative machine learning models) or documents which lie the nearest to a decision boundary (in the case of discriminative machine learning models).

However, it could also mean that a document doesn’t lie near any boundary or have any probability estimate associated with it, because the appropriate signals have not yet been added to the model. In such cases, the best way–nay even the only way–of doing uncertainty sampling is to randomly sample from the collection, as random sampling helps you discover those documents, and therefore those decision boundaries, that you otherwise would not be aware of.  Thus, active learning as a general workflow pattern also incorporates random sampling.

So again, it is still not clear to me exactly what makes the Recommind predictive coding workflow unique, what distinguishes it from methods that have gone before, what its core characteristics are.  That isn’t to say that they don’t exist.  However, I believe further discussion is warranted, both in public as well as at workshops such as DESI (http://www.umiacs.umd.edu/~oard/desi4/), as this will serve to advance the market as a whole.  That is, I agree with Barry Murphy over at eDiscovery Journal that:

No matter what, this is good news for the eDiscovery market as a whole.  One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what “predictive coding” capabilities they do/do not have. While that might be a side-effect, it’s more likely that Recommind is trying to take the heat around predictive coding and have it warm up the vendor’s prospects more than anything else. We at eDJ take this as a call to better define what predictive coding is and what solutions need to offer to be valuable.

I take this as a call for vendors not only to define exactly what “predictive coding” capabilities they do/do not have, but for the industry as a whole to begin to set court-friendly guidelines around what predictive coding truly is.

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.

Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Information retrieval is discipline from the 1970s. Relational databases arrived in the 1960s. Most e-discovery platforms combine full text search (from information retrieval) and a relational database. What do you think is new and exciting in the world of e-discovery with tools that are 40 and 50 years old? Do you think there is a magic algorithm that will be used in e-discovery that will be as disruptive as Google PageRank was for broad Internet searching?

JEREMY PICKENS: There are a number of different angles from which one could approach this question. Recall from a previous blog post that one of the primary distinguishing factors between web search and e-discovery search is that the former is geared toward finding the one best answer, such as a factoid or a home page (precision-oriented), whereas the latter typically requires thousands if not millions of relevant (responsive) documents in order to satisfy an information need. This difference is not insignificant; it changes the entire nature of the search system being designed to meet that need.

Take PageRank, as per your example. It is important to understand that what makes PageRank work so well for web-oriented search has as much (if not more) to do with what the user is trying to accomplish as it does with the algorithm itself. Stop for a moment and read that sentence again. Web users typically want a single, best answer. And quickly. What is the best way to satisfy that information need? It is to give the web searcher a result that a lot of other people already think is pretty good, e.g. “votes” that come in the form of link data. If enough web pages link to single web page and use topically relevant keywords in that link’s anchortext, that web page will be boosted in the rankings. That web page will be “voted” to the top.

More to the point: The specific algorithm that is used to count those votes is not as important as simply having the votes in the first place. Having the votes is what moves your page from page 57 of the results to page 1. A better algorithm might move the page to rank #2 on page 1, rather than rank #9 on page 1. But 90% of what got that document to page 1 was the votes themselves, rather than the mathematics of how the votes were counted. And simply being on page 1 accounts for 90% of the success of PageRank, as typical web searchers will only look at the first page of results and almost never further.

In summary, it is not so much the PageRank algorithm (mathematics) that makes PageRank so successful. It is the signal (link “votes”) used as input to the algorithm; the signal correlates well with the ultimate user goal.

So the question is whether there will ever be a magic algorithm for e-discovery that will be as disruptive as PageRank. This is the same as asking whether there will ever be a single signal (such as a link “vote”) that correlates well with the user goal or intention. At the risk of making too bold of a claim, I think that the answer is no.

Jeremy Pickens

An e-discovery searcher’s information need simply does not fit the “magic bullet” profile. Someone engaged in e-discovery does not look at the first page of results and stop. That person (or a team of reviewers) may look at 20 pages. Or 100 pages. So whether one of the many available relevant documents is on page 1 or on page 57 matters much less. The user information need does not match what PageRank — or PageRank-like magic bullet algorithms — is trying to do.

Magic bullet algorithms try to get the absolute single best result (or small handful of few results) to the very top of the list. E-discovery users need thousands or millions of relevant results. And when there is that much information, there is going to be a huge diversity of signals and coordination between dozens of various algorithms to exhaustively find everything.

Please note, however, that this does not mean algorithmic approaches will not work for e-discovery. Quite the contrary; e-discovery is in need of more, better and smarter algorithms. And these algorithms will improve our ability and capacity to meet the e-discovery challenge. It is just that the algorithms developed will not be “magic bullet” algorithms. They will be like a well-coordinated orchestra, with dozens of components playing together in unison.

(Image: Felipe Micaroni Lalli per Creative Commons.)

Search Q&A: How to Evaluate the Quality of an E-Discovery Search Platform

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Since e-discovery is already costly and time consuming, there doesn’t seem to be a good way for customers to compare offerings by running a case in different systems. Besides sales slicks, acronyms and generic testing such as TREC, how do you think customers should evaluate the quality of the platform they have chosen to handle e-discovery?

JEREMY PICKENS: This is a good question, and one to which there is no single, easy answer. That said, one possibility would be for the platform itself to give you internal metrics in the form of goal-oriented progress prediction. For example, if your goal is to find all responsive or privileged documents in a collection, a good platform should not only give you an estimate of how many more responsive documents it thinks are available to be found, but also let you track the history of that prediction before and after various events.

One should be able to get a sense of how right or wrong that prediction was, as one’s session-based information-seeking task progresses.

Specifically, if that estimate changes drastically after the execution of a new query, or after the responsiveness coding of a particular set of documents, that should be brought to the user’s attention. In a quality platform, it is less important that the platform get the prediction right at the very beginning, than it is the platform is forthcoming and transparent with its mistakes. This should allow the user to work concertedly with the platform toward the goal.

Note: Jeremy Pickens, Bruce Kiefer and John Tredennick have written a research paper that expands on this topic, Process Evaluation in eDiscovery as Awareness of Alternatives. Pickens will present the paper June 6 at the ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings (DESI IV).