Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

Analytics

Best Practices in Predictive Coding: When are Pre-Culling and Keyword Searching Defensible?

Predictive coding is an effective e-discovery tool for ranking large sets of documents. However, it is commonly performed in a manner that may be severely under-inclusive–and therefore raise concerns about its defensibility.

In the use of predictive coding, it is a common practice for the producing party to run keyword searches first, and then sample and rank the resulting documents.  The documents that don’t hit on the searches are culled out before reaching the predictive coding process.

The reasons for doing it this way are:

  • Keyword searching is an accepted standard in e-discovery.
  • The client can avoid the per-document cost for the predictive coding software.
  • It reduces the number of documents that need to be reviewed and produced, which reduces time, cost, and risk.

However, this approach ignores the “dirty little secret” of e-discovery search—that keyword searches leave behind a large set of responsive/relevant documents.

Rank ALL (or Most) Documents, Not Just the Hits

The landmark e-discovery study about keyword searching was published in 1985 by Blair and Maron. David C. Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, 28 COMMUNC’NS. OF THE ACM 289, 295 (1985). In that study, the attorneys were confident that their searches had found more than 75% of the responsive documents.  But they were wrong.  In fact, the searches had only found 20% of the relevant documents.

This was the only study on the subject for many years. Despite it, keyword searching nevertheless became the accepted practice, largely because it was the best approach available.  More recently, however, studies by TREC and others have shown that Blair and Maron were right. TREC’s 2008 study found that keyword searching returned an average of just 24% of responsive documents. In its 2007 study, the result was 22%. Other studies returned even less.

While sampling the leave-behinds and iterative searching by adding terms discovered during the review can solve part of the problem, plenty of documents will be omitted from the review and production. 

This is the reason that going through the predictive coding process against ALL documents, rather than just the keyword search hits, is generally the most defensible practice.  Predictive coding based on sampling ALL the documents will find documents that the keyword searches miss.

How to use Keyword Searching with Predictive Coding

Does this mean that you don’t need keyword searching and other pre-culling techniques?  Of course not. Here are some of the ways we use keyword searching in conjunction with predictive coding:

  • Junk removal. We always analyze the documents for “junk” that can obviously be removed.  For example, in the Enron collection, it certainly makes sense to cull out the fantasy football documents first.  In a securities case, there are usually huge numbers of email market letters and irrelevant stock recommendations that can defensibly be removed.
  • Boosting “richness.” In most cases, the predictive coing software won’t work as well if the ratio of relevant documents to irrelevant documents (“richness percent”) is too low, meaning that the relevant documents are “sparse.”  It is legitimate and defensible to use keyword searching to boost the “richness” to a reasonable ratio before starting the predictive coding exercise.  Note, however, that it may make sense later to have the software rank the documents that didn’t hit on the keyword searches, as if it were a later rolling upload, to see if the software finds additional relevant documents.
  • Targeted searches. Often certain terms (or combinations of terms) will serve as a “rifle shot” to find important documents.  For example, in a patent case, a search for a technical term, such as a chemical name, may be important, especially when paired with the name of the inventor or the opposing party.  Targeted searching can be used at the beginning for sampling to find seed documents.  And, of course, it should be used throughout the review to find “rifle shot” documents and documents on sparsely populated issues that do not lend themselves to predictive coding. 
  • Metadata. Many predictive coding applications only analyze text and not metadata.  Obviously, there are many times when searching metadata is a key to finding relevant documents or filtering out irrelevant ones.  For example, in finding privileged communications, it helps to look in the TO and FROM fields to see if there are attorneys and clients in them.  Similarly, date filters are critical in filtering out documents that may contain “relevant” terms but are irrelevant to the issues in the case because of the time frame. 
  • Sampling and discrepancy analysis. While predictive coding applications typically include sampling methodologies, it is nevertheless a good idea to do additional sampling outside of the application, which can be done by searching for documents likely to be relevant. In particular, the discrepancy analysis, which compares software predictions with actual coding, will find documents that were given a low rank by the software.  Once that happens, you can go search for similar documents using “More Like This” and keyword searching. 

The Bottom Line

If you run predictive coding with ALL the documents, and not just those that hit on keyword searches, you will find more relevant documents, so the process is more defensible.  But keyword searching and searching metadata are still critical tools in the e-discovery toolbox.

 

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need.

Article Collaboration and Improvement DriveWith collaborative search, on the other hand, you are a member of a team consisting of at least one other person, possibly more. You are actively working together with that person to satisy a jointly held information need. My favorite example is of a couple looking to find a house or apartment. It does not help you to know that “people who bought this house also bought that house,” or that “people who live in this apartment also have lived in that apartment.” You are not going to move in together with all those people. You are going to move in with your partner.

And so as you are both searching for places to live, each of you enters different criteria about what is and is not important to you. You might like to live somewhere with great southern-facing exposure. Your partner might like a place with a garden. You might like a kitchen on the upper floor, and your partner might like enough work space in which to tinker on her motorcycle. A collaborative information seeking system should then attempt to find houses or apartments that satisfy both of your needs, jointly and simultaneously.

It is my belief that collaborative information seeking is much more appropriate to e-discovery than is collaborative filtering. Imagine collaborative filtering (“people who bought this also bought that”) in an e-discovery context: “People who have judged this document as responsive have also judged that document as responsive.” Of what value is it to know this? Given that someone else has already judged the document as responsive, why do I need to look at it? Unless I am doing quality control, it is simply a waste of time and client resources for the reviewer to judge again a document that has already been judged. Collaborative filtering falls apart in the e-discovery context, as it yields unnecessary repetition of labor. Collaborative filtering might work very well for toaster ovens, as you will still buy the toaster oven even if your friend has already bought the same model. It does not work well for e-discovery, as there is no sense in judging a document if your “friend” has already judged it.

By contrast, this is where collaborative search shines. Collaborative search allows you to find information that has not been viewed/judged/assessed by any member of your team of two or more people, but that is jointly relevant to the task that you are all working on, together. Collaborative search allows you and your team members jointly to push deeper into the collection, to documents that none of you would have likely found, were you working alone. Just as collaborative search allows you to find that house or apartment with both the southern exposure as well as the motorcycle workshop, it allows you to find documents that satisfy both the lead counsel’s as well as the review manager’s understanding of the task.

The Recommind Patent: Reactions Roll In From Across the Industry

After Recommind announced June 8 that it had obtained a patent on predictive coding, the news rapidly rebounded throughout the e-discovery industry. In a Law Technology News article published the same day, Recommind Intends to Flex Predictive Coding Muscles, reporter Evan Koblentz quoted Craig Carpenter, Recommind’s general counsel and vice president of marketing, as saying that the company would “seek to license the patents to other companies that already offer their own versions of predictive coding or that want to have the ability.”

John Tredennick

Koblentz also spoke to Catalyst’s CEO, John Tredennick, who said, “We’re puzzled that you can get a patent on what seems to be 40 years in the making in the academic community.” The next day, in a post at the blog Above the Law, John’s response and those of other Recommind competitors were characterized as jealous and grumpy.

Call it grumpiness if you will, but others in the e-discovery industry continue to weigh in on the patent with comments that are every bit as skeptical. Here at the Catalyst blog, Tredennick wrote a more-detailed explanation of his position, Predictive Coding: One Grumpy Old Competitor Speaks Up, and Catalyst’s senior applied research scientist, Jeremy Pickens, wrote an in-depth analysis, The Recommind Patent and the Need to Better Define ‘Predictive Coding’.

From elsewhere in the industry, other voices chimed in. Yesterday, Equivio distributed a statement to users of its software, saying that nothing in the Recommind patent “would inhibit, in any way, the use of Equivio software.” It goes on to say:

Recommind’s patent covers a very specific technique, within the predictive coding arena, for a very specific scenario. Recommind’s original request was in fact very broad, but the patent examiner rejected this request, and confined the patent to a particular threshold mechanism in a rolling loads scenario. Bottom line–other techniques for predictive coding are legitimate, and there are many different approaches available in the industry.

Indeed, at the time of Recommind’s filing, in May 2010, there were many vendors actively offering predictive coding applications in the e-discovery market. This was clear to anyone attending last year’s LegalTech New York conference in February 2010 or to anyone following the industrial and academic work at TREC 2009 and 2010. In their 2010 survey report on predictive coding vendors, the eDiscovery Institute lists 11 predictive coding providers. In addition to Equivio and Recommind, the companies surveyed included Capital Legal Solutions, Catalyst, FTI Technology, InterLegis, Kroll Ontrack, Valora and Xerox.

Equivio’s statement says that it has a number of pending patent applications on predictive coding with filing dates that pre-date the Recommind filing.

Venkat Rangan

Another who questioned the patent was Venkat Rangan, founder and CTO of Clearwell. In a post at the blog e-discovery 2.0, Rangan squarely challenged the patent’s validity:

[W]e think the claims issued in the patent and the associated workflow are so commonly used that the workflow is neither novel nor non-obvious to a trained practitioner, and there is enough prior art on each of the individual technologies to warrant a re-examination and eventual invalidation of the patent. In any event, it is fairly easy for anyone to pick up existing prior art and devise a similar workflow that achieves the same or better outcome, and attempt to enforce the patent will likely be challenged.

Rangan takes it further, arguing that the patent is not just bad, but is bad for the corporations and law firms that use e-discovery technology.

[T]here is an even bigger issue at stake here beyond the status of Recommind’s patent: namely, shouldn’t the e-discovery vendor community continue to work, as it has for years, toward what is in the best interest of the legal community and, more broadly, the justice system? Recommind’s thinly veiled threats about requiring industry participants to license their technology are an affront to those who have invested years developing the technology and practicing the approach in real-world e-discovery cases. … Wouldn’t a better outcome be for corporations and law firms to benefit from the innovation that comes from free competition in the marketplace, while still honoring the sort of novel, non-obvious innovation that warrants patent protection?

Several others offered similar opinions questioning the validity of the patent. At her blog Ride the Lightning, Sharon Nelson, president of Sensei Enterprises, said, “I personally agree with John Tredennick that this technology has been decades in the making–it is likely to be challenged as not being novel and as being obvious.” Monica Bay, editor-in-chief of Law Technology News, wonders whether Recommind is blowing smoke. Although she usually steers clear of this sort of industry “pissing contest,” she says, she can’t help but believe that the Recommind patent “seems pretty darned broad and over-reaching.”  And Herbert L. Roitblat, CTO of OrcaTec, wrote at his Information Discovery blog, “Having examined the patent carefully, I can say that this patent covers only a very narrow method of computing in predictive coding and is unlikely to have any impact on the ability of any other eDiscovery service provider to continue to offer this game-changing capability.”

Regardless of whether the patent stands or falls, some industry observers see a silver lining in this brouhaha. Katey Wood, an analyst at Enterprise Strategy Group, writes that any patent battle could have the effect of promoting the wider use and acceptance of predictive coding

Bob Tennant

among e-discovery professionals. “My hope is that a patent battle lends predictive coding more credibility in the legal market, and finally helps customers find religion where logical arguments haven’t succeeded,” she writes. And Barry Murphy, co-founder and principal analyst at eDiscoveryJournal, writes in a post there that this is, ultimately, all good for the industry. “One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what ‘predictive coding’ capabilities they do/do not have,” he says.

What goes around comes around. Today, the company that gave rise to this controversy responded to it. In a blog post, Recommind CEO Bob Tennant digs in his heels, calling the response of competitors “logical–if somewhat disingenuous.” He staunchly defends the viability of the patent, asserting that it “rests on a foundation a decade in the making and affords us the protection of law in the defense of our property rights.”

Predicting what the future holds for this predictive coding debate would require a crystal ball. To my knowledge, no one has yet patented one of those.

Predictive Coding: One Grumpy Old Competitor Speaks Up

Last week, Law Technology News reporter Evan Koblentz called me to ask about a new patent issued to Recommind for a method of “predictive coding.” At the time, I had only glanced at the patent and told the reporter that I was in no position to comment on its substantive claims over the phone. I did wish Recommind well with its patent and its business—just as I would with respect to other competitors.

As background, I also explained that getting a patent awarded was not the end of the process. Rather, to enforce the patent, one has to meet a number of additional challenges, including proving that the patented device or process was new and innovative. A patent based on works or ideas already in circulation, often referred to as “prior art,” is subject to challenge and revocation.

In response to further questions, I told the reporter that we were “puzzled” as to how a company could get a patent involving a process that had been around academia for more than 40 years. Before the call, I had spoken with Dr. Jeremy Pickens, our Senior Applied Research Scientist, to ask him for his thoughts on the patent. Prior to joining Catalyst, Jeremy’s research at the FX Palo Alto Lab led to six patents in the field of search and information retrieval, including two for collaborative exploratory search systems.

Jeremy had taken a quick look at the patent and wondered how it got it through. You can read his comments about prior research and the state of the industry.

I had to laugh when, after the LTN article came out last week (Recommind Intends to Flex Predictive Coding Muscles), my comments were interpreted as “grumpy” by the good folks at Above the Law.

I found myself smiling because I have been called a lot of things but never grumpy. And, other than being an interested observer, I didn’t feel happy or unhappy about the Recommind patent. As I said to Evan Koblentz, I wish the Recommind people well with their patent and their business—they are doing a lot of exciting things in the industry and deserve their success.

But, for the record (as I used to say when I was a lawyer), the concepts and processes underlying predictive coding are not new. Perhaps Recommind has added a new wrinkle to the process but not much more than that, so far as we can see.

Who Invented ‘Predictive Coding’ Anyway?

The phrase “predictive coding” isn’t new in the industry and was not coined by Recommind. Even before Recommind filed the application for its patent, the Bank of America had already filed an application for a patent on “Predictive Coding of Documents in an Electronic Discovery System” (with the provision application filed on March 27, 2009). Others have used the term for a variety of processes as well. For examples, just Google the phrase.

Although Recommind tried to bull its way through a trademark for the term, the effort failed. As Evan Koblentz later reported on the ALM blog EDD Update, the government rejected Recommind’s attempt to trademark a phrase that was descriptive and already in use by others. (Ironically, the same government agency that granted the patent rejected the trademark.)

Goodbye trademark.

More to the point, the techniques behind predictive coding aren’t new. As Dr. Pickens points out in his post, they go back to the 1970s when search scientists introduced the concept of “relevance feedback” into the lexicon. They realized the simple truth that computer-based search algorithms could be made more effective through an iterative process involving human feedback. And, that work has continued to evolve over the past 40 years.

So, we remain puzzled as to how Recommind could claim a patent around the work of so many others.

The inventor of the Internet.

Ultimately, we are observers in this process because we use different math and techniques than Recommind and many of the others in the market. Specifically, we were one of the first to use a more modern set of algorithms, called non-negative matrix factorization, to analyze document themes and similarities.

This technique, developed at the Massachusetts Institute of Technology, is used widely for facial recognition as well as text analysis. We work closely with Dr. Michael Berry from the University of Tennessee’s Center for Intelligent Systems and Machine Learning, who is a leading proponent of the technique for mathematical search analysis.

Not Grumpy Here

So, we at Catalyst are certainly not grumpy—life’s good here. Nor are we unhappy about Recommind’s patent. Like other bystanders, we will enjoy watching Recommind try to enforce its patent against whomever they suspect might be using similar techniques. I am sure it will make for a great show, perhaps even worthy of truTV.

For the record, we didn’t invent predictive coding or the techniques around relevance feedback. Nor did Recommind. Check with Al Gore. Maybe he did.

 

The Recommind Patent and the Need to Better Define ‘Predictive Coding’

Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL).  This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm).  The purpose of the workshop was, in part:

…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches.

Hot on the list of topics, of course, was predictive coding.  Much of the discussion centered around determining exactly what standards were needed not only to convince users of such systems that non-linear, smart review would save them time and money, but also to convince the courts (and lawyers who don’t want to receive sanctions from the courts) that such technology may be safely applied to a matter at hand while still meeting all the legal requirements of discovery.

So it was with keen interest that I noted the press release from a vendor, Recommind, that it had obtained a patent on the process of predictive coding itself.  Having been involved in writing a few patents in my time, my immediate thought was, “What exactly was patented, what are the specific claims? Is this going to be a broad patent, covering a high level process?  Or is it going to be a narrow patent, covering one or two specific ways of doing predictive coding?”

So I read the patent, and I read Recommind’s explanation, and I read the commentary, including Barry Murphy’s post, Dawn of the Predictive Coding Wars. First, from Murphy’s commentary:

According to Craig, the press release is “about more than terminology: it is about a process patent covering ‘systems and processes’ for iterative, computer-assisted review. Recommind believes it has long been on the record as to exactly what predictive coding is, and as a result of this patent, it expects competing vendors to follow suit accordingly, and stop claiming predictive coding capabilities they do not have.” Clearly, Recommind feels it has pioneered the concept of predictive coding and doesn’t want any competitors riding on coattails.

Second, from the explanation:

Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. …

The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan.  But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part.  Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.

OK, so predictive coding as a whole (and therefore the patent on predictive coding) is not a single technology, so much as it is a “process, workflow, and documentation.” Fine; I’ll accept that. However, nowhere in this post entitled “Predictive Coding Explained” were the process, workflow and documentation really ever explained. Great pain was taken to say what predictive coding was not (e.g. threading, clustering, etc. – which I agree with).   But no actual logical sequence of steps was given as to what predictive coding, at least from the perspective of this patent, was supposed to be.

For that, I had to turn to the patent itself. See Figure 5 in the patent (above), labeled “Predictive Coding Workflow.” See also Claim #1 (the top level independent patent claim).  That claim says that the patent covers a method for analyzing a plurality of documents, comprising:

(1) Receiving the plurality of documents via a computing device

(2) Receiving user input from the computing device, the user input including hard coding [aka labeling] of a subset of the plurality of documents, the hard coding based on an identified subject or category [e.g. responsiveness, privilege, or issue]

(3) Executing instructions stored in memory, that:

(a) generates an initial control set based on the subset of the plurality of documents and the received user input on the subset

(b) analyzes the initial control set to determine at least one seed set parameter associated with the identified subject or category

(c ) automatically codes a first portion of the plurality of documents, based on the initial control set and the at least one set seed parameter associated with the identified subject or category

(d) analyzes the first portion of the plurality of documents by applying an adaptive identification cycle, the adaptive identification cycle being based on the initial control set, user validation of the automatic coding of the first portion of the plurality of documents and confidence threshold validation

(e ) retrieves a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle on the first portion of the plurality of documents

(f) adds further documents to the plurality of documents on a rolling load basis, and conducts a random sampling of initial control set documents both on a static basis and the rolling load basis

(4) receiving user input via the computing device, the user input comprising inspection, analysis and hard coding of the randomly sampled initial control set documents, and

(5) executing instructions stored in memory , wherein execution of the instructions by the processor automatically codes documents based on the received user input regarding the randomly sampled initial control set documents

So that appears to be the primary workflow, the primary patented claim.  Let’s compare and contrast that workflow with that of traditional relevance feedback. Though relevance feedback dates back to the early 1970s, here is a passage from the Introduction to Information Retrieval (published in 2008) describing the basic workflow:

The idea of relevance feedback is to involve the user in the retrieval process so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is:

  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval results.
  • The user marks some returned documents as relevant or nonrelevant.
  • The system computes a better representation of the information need based on the user feedback.
  • The system displays a revised set of retrieval results.

Relevance feedback can go through one or more iterations of this sort.

In other words, the relevance feedback workflow seems to do everything that the predictive coding workflow does.  It starts with a collection of documents. It selects a subset of those documents in some manner.  It presents those documents to a human annotator for expert labeling. Based on the labels provided by the human, the algorithm goes through an “adaptive identification cycle” in which it modifies itself so as to better align itself with the human understanding of the document labels. And, based on this adapted algorithm, it revises the set of results. That is, it recomputes the probabilities of the labels (relevance or nonrelevant, responsive or nonresponsive) for all the results.  Finally, it should be noted that the traditional, decades-old relevance feedback process workflow also is capable of iteration.

So what is the difference? I don’t just ask this rhetorically. I see a very strong similarity in the overall workflows between both predictive coding and relevance feedback, so I would honestly and transparently like to understand where the crucial differences are. If we are to understand what Recommind believes predictive coding to be–and if this understanding is going to help the courts set the legal precedent for defensible use of these technologies, a goal in which I fully agree with Recommind–then we really need to understand the process as a whole and what makes it unique.

The only thing I can think of is that there are a few occasions in the claimed predictive coding workflow that integrate random sampling and this is most likely to insure that the process is defensible. If that is the case, then how does that differ from active learning? Here is an example of the active learning workflow which incorporates uncertainty-based sampling, from a 2007 academic research paper by Andreas Vlachos, “A Stopping Criterion for Active Learning“:

Input:

seed labelled data L, unlabelled data U,

batch size b

Initialization:

Train a model on L

Active Learning Loop:

Until a stopping criterion is satisfied:

Apply the trained model classifier on U

Rank the instances in U using the uncertainty of the model

Annotate the top b instances and add them to L

Train the model on the expanded L

That is, instead of just presenting the expert user (e.g. lawyer) with the documents that have the highest probability of responsiveness, or of privilege, or of whatever issue they’ve been coded for, an active learning process or workflow explicitly seeks to add those document instances about which the learning algorithm is the most uncertain. That could mean documents for which the probability of that document’s label is relatively even or undistinguished (highest entropy) across all classes (in the case of generative machine learning models) or documents which lie the nearest to a decision boundary (in the case of discriminative machine learning models).

However, it could also mean that a document doesn’t lie near any boundary or have any probability estimate associated with it, because the appropriate signals have not yet been added to the model. In such cases, the best way–nay even the only way–of doing uncertainty sampling is to randomly sample from the collection, as random sampling helps you discover those documents, and therefore those decision boundaries, that you otherwise would not be aware of.  Thus, active learning as a general workflow pattern also incorporates random sampling.

So again, it is still not clear to me exactly what makes the Recommind predictive coding workflow unique, what distinguishes it from methods that have gone before, what its core characteristics are.  That isn’t to say that they don’t exist.  However, I believe further discussion is warranted, both in public as well as at workshops such as DESI (http://www.umiacs.umd.edu/~oard/desi4/), as this will serve to advance the market as a whole.  That is, I agree with Barry Murphy over at eDiscovery Journal that:

No matter what, this is good news for the eDiscovery market as a whole.  One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what “predictive coding” capabilities they do/do not have. While that might be a side-effect, it’s more likely that Recommind is trying to take the heat around predictive coding and have it warm up the vendor’s prospects more than anything else. We at eDJ take this as a call to better define what predictive coding is and what solutions need to offer to be valuable.

I take this as a call for vendors not only to define exactly what “predictive coding” capabilities they do/do not have, but for the industry as a whole to begin to set court-friendly guidelines around what predictive coding truly is.

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.

Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Information retrieval is discipline from the 1970s. Relational databases arrived in the 1960s. Most e-discovery platforms combine full text search (from information retrieval) and a relational database. What do you think is new and exciting in the world of e-discovery with tools that are 40 and 50 years old? Do you think there is a magic algorithm that will be used in e-discovery that will be as disruptive as Google PageRank was for broad Internet searching?

JEREMY PICKENS: There are a number of different angles from which one could approach this question. Recall from a previous blog post that one of the primary distinguishing factors between web search and e-discovery search is that the former is geared toward finding the one best answer, such as a factoid or a home page (precision-oriented), whereas the latter typically requires thousands if not millions of relevant (responsive) documents in order to satisfy an information need. This difference is not insignificant; it changes the entire nature of the search system being designed to meet that need.

Take PageRank, as per your example. It is important to understand that what makes PageRank work so well for web-oriented search has as much (if not more) to do with what the user is trying to accomplish as it does with the algorithm itself. Stop for a moment and read that sentence again. Web users typically want a single, best answer. And quickly. What is the best way to satisfy that information need? It is to give the web searcher a result that a lot of other people already think is pretty good, e.g. “votes” that come in the form of link data. If enough web pages link to single web page and use topically relevant keywords in that link’s anchortext, that web page will be boosted in the rankings. That web page will be “voted” to the top.

More to the point: The specific algorithm that is used to count those votes is not as important as simply having the votes in the first place. Having the votes is what moves your page from page 57 of the results to page 1. A better algorithm might move the page to rank #2 on page 1, rather than rank #9 on page 1. But 90% of what got that document to page 1 was the votes themselves, rather than the mathematics of how the votes were counted. And simply being on page 1 accounts for 90% of the success of PageRank, as typical web searchers will only look at the first page of results and almost never further.

In summary, it is not so much the PageRank algorithm (mathematics) that makes PageRank so successful. It is the signal (link “votes”) used as input to the algorithm; the signal correlates well with the ultimate user goal.

So the question is whether there will ever be a magic algorithm for e-discovery that will be as disruptive as PageRank. This is the same as asking whether there will ever be a single signal (such as a link “vote”) that correlates well with the user goal or intention. At the risk of making too bold of a claim, I think that the answer is no.

Jeremy Pickens

An e-discovery searcher’s information need simply does not fit the “magic bullet” profile. Someone engaged in e-discovery does not look at the first page of results and stop. That person (or a team of reviewers) may look at 20 pages. Or 100 pages. So whether one of the many available relevant documents is on page 1 or on page 57 matters much less. The user information need does not match what PageRank — or PageRank-like magic bullet algorithms — is trying to do.

Magic bullet algorithms try to get the absolute single best result (or small handful of few results) to the very top of the list. E-discovery users need thousands or millions of relevant results. And when there is that much information, there is going to be a huge diversity of signals and coordination between dozens of various algorithms to exhaustively find everything.

Please note, however, that this does not mean algorithmic approaches will not work for e-discovery. Quite the contrary; e-discovery is in need of more, better and smarter algorithms. And these algorithms will improve our ability and capacity to meet the e-discovery challenge. It is just that the algorithms developed will not be “magic bullet” algorithms. They will be like a well-coordinated orchestra, with dozens of components playing together in unison.

(Image: Felipe Micaroni Lalli per Creative Commons.)

Automatic Footers: Toothless Legal Verbiage Causes Search Headaches

“The contents of this email may be privileged and confidential and are intended for the use of the intended addressee(s) only. Unless you are the addressee, you may not use, copy or disclose to anyone the message or any information contained in the message. Under penalty of death, public ridicule, and death a second time you are legally obligated to: 1) Delete this email and all copies. 2) Destroy your computer and email server using fire, sledge hammer, and/or atomic weapon. 3) Bury the remains of step two in a haunted pet cemetery. 4) Confess to a religious leader of your choosing that you read an unintended electronic communication and promise never to do it again.”

For anyone who has generated or received an email from a law firm or major corporation, you’ve become accustomed to the obligatory paragraph of legal jargon that follows even the most rudimentary of emails. You’ve seen how that added paragraph, when multiplied during the course of normal correspondence, takes what should be a well-formatted email reply and turns the whole string into an endless chain that even M.C. Escher would be proud of.

This added text is something we put up with believing it is a necessary evil. We are under the belief that this paragraph could someday come to the rescue should we accidentally send a confidential email to Craigslist instead of Craig in accounting. Well world, be prepared to be shattered. According to a recent article in The Economist, Spare us the e-mail yada-yada, they are probably pointless.

They are assumed to be a wise precaution. But they are mostly, legally speaking, pointless. Lawyers and experts on internet policy say no court case has ever turned on the presence or absence of such an automatic e-mail footer in America, the most litigious of rich countries.

Many disclaimers are, in effect, seeking to impose a contractual obligation unilaterally, and thus are probably unenforceable. This is clear in Europe, where a directive from the European Commission tells the courts to strike out any unreasonable contractual obligation on a consumer if he has not freely negotiated it. And a footer stating that nothing in the e-mail should be used to break the law would be of no protection to a lawyer or financial adviser sending a message that did suggest something illegal.

How effective can something be that is automatically generated and unilaterally imposed? An unintended recipient could just as easily argue that the notice does not actually represent any sort of subjective intent to claim privilege since it is being used on emails that range from afternoon pizza orders to communications with opposing counsel.

What Does This Have to Do with Search?

Despite their likely futility, these blocks of text are not going anywhere. Instead, those of us in e-discovery need to accept that they are going to be in our collections and find solutions for how to work around them.

While “filler” text has implications to many culling and review analytics, it is most felt in identification of privileged documents. When every email in your collection contains the words “Privileged and Confidential” in the footer, a simple Boolean search is not going to cut it.

At Catalyst, we’ve approached this problem in a few different ways. The most effective is to temporarily alter the index in a way that excludes these likely “false positives.” In order for this to work, you need to devote some time to sampling your collection. Are there style encodings that designate footer text? Or is the footer text consistent across the collection so that it can be easily identified? Once you know what is to be removed, run your searches or other analytics, update for Potentially Privilege and restore the original index.

A less text-intensive approach would be to focus on the metadata in your collection. Go beyond your standard privilege search and look at who the author and recipients are in your collection. Ninety percent of privilege comes down to who sent it and who received it. Focus on communications sent solely between privileged parties. Other documents can be further classified according to likelihood of being privileged. For long communications spanning varied recipients, use email threading tools to identify where privilege breaks or is created.

With the right amount of planning and forethought when embarking on a document review strategy, even if these automated footers are useless, you won’t be.

Search Q&A: What E-Discovery Search Can Learn From Music Search

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

Bruce Kiefer

BRUCE KIEFER: You’ve been involved with e-discovery for just over six months. You’ve attended LegalTech in New York in January, studied companies and talked to customers. How would you describe the state of information retrieval in e-discovery compared to your previous research in music and collaboration?

JEREMY PICKENS: Information retrieval is a decades-old field with many different types, techniques and algorithms supporting a wide variety of user information needs. Not all technologies are equally appropriate in every context; some are more appropriate to the web (navigational searches), some to music (recommendation engines), and some to e-discovery (recall-oriented informational searches).

E-discovery has borrowed and experimented with many classic information retrieval techniques, such as query suggestion via term co-occurrence analysis, query expansion via analysis of morphological variants (e.g. “stemming”), and clustering. I also find it encouraging that relevance feedback is making its way into e-discovery in the form of predictive and suggestive coding. The field’s core retrieval framework is Boolean, rather than more modern probabilistic and learning-to-rank approaches, which is unfortunate. Modern ranking algorithms are much more effective. The main impediment to moving beyond Boolean retrieval seems to be legal precedent rather than lack of technological prowess, so I am confident that this will continue to evolve.

Jeremy Pickens

There is one area, however, in which e-discovery could learn from work in music search and collaborative search. That area is the notion of information seeking as a multi-stage, session-based task. Most traditional information retrieval frameworks have been developed for ad hoc information needs, meaning that a query is issued and an answer given, at which point the interaction ends. The system treats subsequent queries as unrelated.

E-discovery is different, and is more in line with music search and collaborative search, in that the users’ information-seeking activity is ongoing and more than a single piece of information (e.g. document) is sought. Session-oriented thinking allows for a different approach to retrieval system design, and I expect to see an increase in awareness for these types of approaches.

Search Q&A: Introduction to a Series of Posts on the Science of Search

A good friend of mine, Miles Kehoe, co-author of the blog Enterprise Search, tells the story of his days at Verity when Google was emerging. Verity held a focus group to get human feedback from users about the quality of their search results. For some simple A-B testing, they also included feedback from customers using Google. It didn’t take long for Verity to realize that people really liked Google.

To confirm this observation, they changed the parameters of the test. They mimicked the minimalist style of Google’s result list and put the Google logo on top of the Verity results. Now, the focus group was divided in two: people looking at the Verity results with a Google logo and people looking at the Verity results with the Verity logo. The content was the same and the focus group still preferred the results from the page with the Google logo.

Yahoo! shut down AlltheWeb on April 4, 2011.

Brands are really powerful as I learned again from the Verity-Google story. Google’s BackRub (as Larry Page and Sergey Brin first called their search engine) is a useful ranking system in the world of online content. From this algorithm, Google upset the status quo of the day. Companies like AlltheWeb, Lycos and AltaVista had different algorithms and quickly saw their market share eroding. Yahoo acquired AlltheWeb and finally shut it down on Apr 4, 2011.

The world of search on the web and the world of search in e-discovery are different animals. The plumbing of search shares many components, whether your goal is to find out about the royal wedding, get a recommendation about a digital camera, or find a key custodian in your collection. The best way to use search in e-discovery depends on your goal—reverted queries, inverted indexes, term frequency, entropy and other tools are available. Why does one algorithm work well in one use case and not work in a different use case?

In this series of posts, prodded by questions from me, Catalyst Senior Applied Research Scientist Jeremy Pickens will discuss some of the nuances within e-discovery and search.

Now on to our first Q&A ….