Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

Analytics

The Cloud Will Soon Overtake On-Premise Computing, Legal IT Survey Says

The future of legal technology is looking cloudy — and that’s not a bad thing. Cloud computing is on track to overtake on-premise computing within the legal services industry in the very near future, according to a recently published survey of legal IT professionals. Fifty-seven percent of those surveyed predicted that this will happen within five years and 81 percent said it will be within 10 years. Only 16 percent said it would never happen.

The survey was conducted in September by the publication Legal IT Professionals and its results were published Nov. 26. The online survey of the publication’s global readership elicited 438 responses, representing law firms ranging in size from small boutiques to global megafirms. More than three-quarters of respondents work directly in legal IT, either within a firm (54 percent) or as external consultants (24 percent). Lawyers and paralegals made up 22 percent of respondents.

The inevitability of the cloud overtaking on-premise computing is driven in part by the increasing prevalence of mobile devices within the legal industry, the survey found.

As connectivity – particularly mobile connectivity – becomes ubiquitous, and lawyers, like everyone else, become culturally accustomed to accessing everything online, cloud computing is likely to become the de facto delivery model for information and applications.

But the cloud also offers inherent advantages that are driving its ever-increasing popularity. “Cloud computing transcends geographical boundaries and storage limitations,” the survey noted. “It supports business continuity and disaster recovery.”

In fact, the survey’s respondents cited business continuity as among the top benefits of cloud computing. Asked what they considered to be the main benefits of the cloud, their top answers were:

  • Flexibility/Agility, 55%.
  • More mobility, 54%.
  • Business continuity, 52%.
  • Scalability, 47%.
  • Cost savings, 40%.
  • Ease of implementation, 21%.
  • Focus on core business, 18%.
  • Going green, 13%.

Although the survey identified a clear trend towards cloud computing, it also established that both legal professionals and clients maintain reservations. For example, respondents were asked, “If your law firm’s management asked for your advice regarding moving key applications to the cloud, would you be in favor of this strategy?” Responses were an even split, with 45 percent in favor and 46 percent against moving key applications to the cloud. Smaller firms were more likely than larger firms to embrace a cloud strategy. “Law firms are notoriously risk averse and tend to be what one lawyer described as ‘proud second movers’ when it comes to technology,” the survey suggested.

In a similar vein, 60 percent of respondents believed that their clients might be concerned if key applications and services were hosted in the cloud. “The biggest concerns about this are among CIO/CTOs (67%) and general IT staff (68%), who are perhaps the most risk aware groups surveyed and have to deal directly with any security breach or outage,” the survey explained.

Shift in Attitude

Still, there is a general shift in attitude in favor of cloud computing, the survey found. More than half of respondents said they are more positive about it now than a year ago. Only 10 percent of respondents said that their opinion about cloud computing had become more negative.

As for the future, respondents overwhelmingly cited security and client confidentiality as the biggest challenges that they would have to address before moving IT resources to the cloud. Across all roles, firm sizes and locations, between 73 percent and 90 percent of respondents said that security was their top concern.

In the final analysis, the authors of the survey report conclude that the tide has turned for cloud computing and that the cloud is here to stay.

The tide has turned, particularly in the mid-markets which are facing competition from market entrants, large firms that are driven by market forces to price their services more competitively and specialist boutiques that are utilising cloud computing to access resources and offer services that drive competitive advantage. The smaller, more agile firms are leading the way in outsourcing their entire IT infrastructure to an external cloud provider.

You can download the complete Global Cloud Survey Report from the Legal IT Professionals home page or directly from this link. The full report contains additional questions and details about responses, along with selected quotes from respondents. The report includes an introduction written by Nicole Black, author of the ABA book, Cloud Computing for Lawyers, in which she offers her perspective on the results.

 

 

Best Practices in Predictive Coding: When are Pre-Culling and Keyword Searching Defensible?

Predictive coding is an effective e-discovery tool for ranking large sets of documents. However, it is commonly performed in a manner that may be severely under-inclusive–and therefore raise concerns about its defensibility.

In the use of predictive coding, it is a common practice for the producing party to run keyword searches first, and then sample and rank the resulting documents.  The documents that don’t hit on the searches are culled out before reaching the predictive coding process.

The reasons for doing it this way are:

  • Keyword searching is an accepted standard in e-discovery.
  • The client can avoid the per-document cost for the predictive coding software.
  • It reduces the number of documents that need to be reviewed and produced, which reduces time, cost, and risk.

However, this approach ignores the “dirty little secret” of e-discovery search—that keyword searches leave behind a large set of responsive/relevant documents.

Rank ALL (or Most) Documents, Not Just the Hits

The landmark e-discovery study about keyword searching was published in 1985 by Blair and Maron. David C. Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, 28 COMMUNC’NS. OF THE ACM 289, 295 (1985). In that study, the attorneys were confident that their searches had found more than 75% of the responsive documents.  But they were wrong.  In fact, the searches had only found 20% of the relevant documents.

This was the only study on the subject for many years. Despite it, keyword searching nevertheless became the accepted practice, largely because it was the best approach available.  More recently, however, studies by TREC and others have shown that Blair and Maron were right. TREC’s 2008 study found that keyword searching returned an average of just 24% of responsive documents. In its 2007 study, the result was 22%. Other studies returned even less.

While sampling the leave-behinds and iterative searching by adding terms discovered during the review can solve part of the problem, plenty of documents will be omitted from the review and production. 

This is the reason that going through the predictive coding process against ALL documents, rather than just the keyword search hits, is generally the most defensible practice.  Predictive coding based on sampling ALL the documents will find documents that the keyword searches miss.

How to use Keyword Searching with Predictive Coding

Does this mean that you don’t need keyword searching and other pre-culling techniques?  Of course not. Here are some of the ways we use keyword searching in conjunction with predictive coding:

  • Junk removal. We always analyze the documents for “junk” that can obviously be removed.  For example, in the Enron collection, it certainly makes sense to cull out the fantasy football documents first.  In a securities case, there are usually huge numbers of email market letters and irrelevant stock recommendations that can defensibly be removed.
  • Boosting “richness.” In most cases, the predictive coing software won’t work as well if the ratio of relevant documents to irrelevant documents (“richness percent”) is too low, meaning that the relevant documents are “sparse.”  It is legitimate and defensible to use keyword searching to boost the “richness” to a reasonable ratio before starting the predictive coding exercise.  Note, however, that it may make sense later to have the software rank the documents that didn’t hit on the keyword searches, as if it were a later rolling upload, to see if the software finds additional relevant documents.
  • Targeted searches. Often certain terms (or combinations of terms) will serve as a “rifle shot” to find important documents.  For example, in a patent case, a search for a technical term, such as a chemical name, may be important, especially when paired with the name of the inventor or the opposing party.  Targeted searching can be used at the beginning for sampling to find seed documents.  And, of course, it should be used throughout the review to find “rifle shot” documents and documents on sparsely populated issues that do not lend themselves to predictive coding. 
  • Metadata. Many predictive coding applications only analyze text and not metadata.  Obviously, there are many times when searching metadata is a key to finding relevant documents or filtering out irrelevant ones.  For example, in finding privileged communications, it helps to look in the TO and FROM fields to see if there are attorneys and clients in them.  Similarly, date filters are critical in filtering out documents that may contain “relevant” terms but are irrelevant to the issues in the case because of the time frame. 
  • Sampling and discrepancy analysis. While predictive coding applications typically include sampling methodologies, it is nevertheless a good idea to do additional sampling outside of the application, which can be done by searching for documents likely to be relevant. In particular, the discrepancy analysis, which compares software predictions with actual coding, will find documents that were given a low rank by the software.  Once that happens, you can go search for similar documents using “More Like This” and keyword searching. 

The Bottom Line

If you run predictive coding with ALL the documents, and not just those that hit on keyword searches, you will find more relevant documents, so the process is more defensible.  But keyword searching and searching metadata are still critical tools in the e-discovery toolbox.

 

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need.

Article Collaboration and Improvement DriveWith collaborative search, on the other hand, you are a member of a team consisting of at least one other person, possibly more. You are actively working together with that person to satisy a jointly held information need. My favorite example is of a couple looking to find a house or apartment. It does not help you to know that “people who bought this house also bought that house,” or that “people who live in this apartment also have lived in that apartment.” You are not going to move in together with all those people. You are going to move in with your partner.

And so as you are both searching for places to live, each of you enters different criteria about what is and is not important to you. You might like to live somewhere with great southern-facing exposure. Your partner might like a place with a garden. You might like a kitchen on the upper floor, and your partner might like enough work space in which to tinker on her motorcycle. A collaborative information seeking system should then attempt to find houses or apartments that satisfy both of your needs, jointly and simultaneously.

It is my belief that collaborative information seeking is much more appropriate to e-discovery than is collaborative filtering. Imagine collaborative filtering (“people who bought this also bought that”) in an e-discovery context: “People who have judged this document as responsive have also judged that document as responsive.” Of what value is it to know this? Given that someone else has already judged the document as responsive, why do I need to look at it? Unless I am doing quality control, it is simply a waste of time and client resources for the reviewer to judge again a document that has already been judged. Collaborative filtering falls apart in the e-discovery context, as it yields unnecessary repetition of labor. Collaborative filtering might work very well for toaster ovens, as you will still buy the toaster oven even if your friend has already bought the same model. It does not work well for e-discovery, as there is no sense in judging a document if your “friend” has already judged it.

By contrast, this is where collaborative search shines. Collaborative search allows you to find information that has not been viewed/judged/assessed by any member of your team of two or more people, but that is jointly relevant to the task that you are all working on, together. Collaborative search allows you and your team members jointly to push deeper into the collection, to documents that none of you would have likely found, were you working alone. Just as collaborative search allows you to find that house or apartment with both the southern exposure as well as the motorcycle workshop, it allows you to find documents that satisfy both the lead counsel’s as well as the review manager’s understanding of the task.

The Recommind Patent: Reactions Roll In From Across the Industry

After Recommind announced June 8 that it had obtained a patent on predictive coding, the news rapidly rebounded throughout the e-discovery industry. In a Law Technology News article published the same day, Recommind Intends to Flex Predictive Coding Muscles, reporter Evan Koblentz quoted Craig Carpenter, Recommind’s general counsel and vice president of marketing, as saying that the company would “seek to license the patents to other companies that already offer their own versions of predictive coding or that want to have the ability.”

John Tredennick

Koblentz also spoke to Catalyst’s CEO, John Tredennick, who said, “We’re puzzled that you can get a patent on what seems to be 40 years in the making in the academic community.” The next day, in a post at the blog Above the Law, John’s response and those of other Recommind competitors were characterized as jealous and grumpy.

Call it grumpiness if you will, but others in the e-discovery industry continue to weigh in on the patent with comments that are every bit as skeptical. Here at the Catalyst blog, Tredennick wrote a more-detailed explanation of his position, Predictive Coding: One Grumpy Old Competitor Speaks Up, and Catalyst’s senior applied research scientist, Jeremy Pickens, wrote an in-depth analysis, The Recommind Patent and the Need to Better Define ‘Predictive Coding’.

From elsewhere in the industry, other voices chimed in. Yesterday, Equivio distributed a statement to users of its software, saying that nothing in the Recommind patent “would inhibit, in any way, the use of Equivio software.” It goes on to say:

Recommind’s patent covers a very specific technique, within the predictive coding arena, for a very specific scenario. Recommind’s original request was in fact very broad, but the patent examiner rejected this request, and confined the patent to a particular threshold mechanism in a rolling loads scenario. Bottom line–other techniques for predictive coding are legitimate, and there are many different approaches available in the industry.

Indeed, at the time of Recommind’s filing, in May 2010, there were many vendors actively offering predictive coding applications in the e-discovery market. This was clear to anyone attending last year’s LegalTech New York conference in February 2010 or to anyone following the industrial and academic work at TREC 2009 and 2010. In their 2010 survey report on predictive coding vendors, the eDiscovery Institute lists 11 predictive coding providers. In addition to Equivio and Recommind, the companies surveyed included Capital Legal Solutions, Catalyst, FTI Technology, InterLegis, Kroll Ontrack, Valora and Xerox.

Equivio’s statement says that it has a number of pending patent applications on predictive coding with filing dates that pre-date the Recommind filing.

Venkat Rangan

Another who questioned the patent was Venkat Rangan, founder and CTO of Clearwell. In a post at the blog e-discovery 2.0, Rangan squarely challenged the patent’s validity:

[W]e think the claims issued in the patent and the associated workflow are so commonly used that the workflow is neither novel nor non-obvious to a trained practitioner, and there is enough prior art on each of the individual technologies to warrant a re-examination and eventual invalidation of the patent. In any event, it is fairly easy for anyone to pick up existing prior art and devise a similar workflow that achieves the same or better outcome, and attempt to enforce the patent will likely be challenged.

Rangan takes it further, arguing that the patent is not just bad, but is bad for the corporations and law firms that use e-discovery technology.

[T]here is an even bigger issue at stake here beyond the status of Recommind’s patent: namely, shouldn’t the e-discovery vendor community continue to work, as it has for years, toward what is in the best interest of the legal community and, more broadly, the justice system? Recommind’s thinly veiled threats about requiring industry participants to license their technology are an affront to those who have invested years developing the technology and practicing the approach in real-world e-discovery cases. … Wouldn’t a better outcome be for corporations and law firms to benefit from the innovation that comes from free competition in the marketplace, while still honoring the sort of novel, non-obvious innovation that warrants patent protection?

Several others offered similar opinions questioning the validity of the patent. At her blog Ride the Lightning, Sharon Nelson, president of Sensei Enterprises, said, “I personally agree with John Tredennick that this technology has been decades in the making–it is likely to be challenged as not being novel and as being obvious.” Monica Bay, editor-in-chief of Law Technology News, wonders whether Recommind is blowing smoke. Although she usually steers clear of this sort of industry “pissing contest,” she says, she can’t help but believe that the Recommind patent “seems pretty darned broad and over-reaching.”  And Herbert L. Roitblat, CTO of OrcaTec, wrote at his Information Discovery blog, “Having examined the patent carefully, I can say that this patent covers only a very narrow method of computing in predictive coding and is unlikely to have any impact on the ability of any other eDiscovery service provider to continue to offer this game-changing capability.”

Regardless of whether the patent stands or falls, some industry observers see a silver lining in this brouhaha. Katey Wood, an analyst at Enterprise Strategy Group, writes that any patent battle could have the effect of promoting the wider use and acceptance of predictive coding

Bob Tennant

among e-discovery professionals. “My hope is that a patent battle lends predictive coding more credibility in the legal market, and finally helps customers find religion where logical arguments haven’t succeeded,” she writes. And Barry Murphy, co-founder and principal analyst at eDiscoveryJournal, writes in a post there that this is, ultimately, all good for the industry. “One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what ‘predictive coding’ capabilities they do/do not have,” he says.

What goes around comes around. Today, the company that gave rise to this controversy responded to it. In a blog post, Recommind CEO Bob Tennant digs in his heels, calling the response of competitors “logical–if somewhat disingenuous.” He staunchly defends the viability of the patent, asserting that it “rests on a foundation a decade in the making and affords us the protection of law in the defense of our property rights.”

Predicting what the future holds for this predictive coding debate would require a crystal ball. To my knowledge, no one has yet patented one of those.

The Recommind Patent and the Need to Better Define ‘Predictive Coding’

Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL).  This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm).  The purpose of the workshop was, in part:

…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches.

Hot on the list of topics, of course, was predictive coding.  Much of the discussion centered around determining exactly what standards were needed not only to convince users of such systems that non-linear, smart review would save them time and money, but also to convince the courts (and lawyers who don’t want to receive sanctions from the courts) that such technology may be safely applied to a matter at hand while still meeting all the legal requirements of discovery.

So it was with keen interest that I noted the press release from a vendor, Recommind, that it had obtained a patent on the process of predictive coding itself.  Having been involved in writing a few patents in my time, my immediate thought was, “What exactly was patented, what are the specific claims? Is this going to be a broad patent, covering a high level process?  Or is it going to be a narrow patent, covering one or two specific ways of doing predictive coding?”

So I read the patent, and I read Recommind’s explanation, and I read the commentary, including Barry Murphy’s post, Dawn of the Predictive Coding Wars. First, from Murphy’s commentary:

According to Craig, the press release is “about more than terminology: it is about a process patent covering ‘systems and processes’ for iterative, computer-assisted review. Recommind believes it has long been on the record as to exactly what predictive coding is, and as a result of this patent, it expects competing vendors to follow suit accordingly, and stop claiming predictive coding capabilities they do not have.” Clearly, Recommind feels it has pioneered the concept of predictive coding and doesn’t want any competitors riding on coattails.

Second, from the explanation:

Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. …

The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan.  But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part.  Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.

OK, so predictive coding as a whole (and therefore the patent on predictive coding) is not a single technology, so much as it is a “process, workflow, and documentation.” Fine; I’ll accept that. However, nowhere in this post entitled “Predictive Coding Explained” were the process, workflow and documentation really ever explained. Great pain was taken to say what predictive coding was not (e.g. threading, clustering, etc. – which I agree with).   But no actual logical sequence of steps was given as to what predictive coding, at least from the perspective of this patent, was supposed to be.

For that, I had to turn to the patent itself. See Figure 5 in the patent (above), labeled “Predictive Coding Workflow.” See also Claim #1 (the top level independent patent claim).  That claim says that the patent covers a method for analyzing a plurality of documents, comprising:

(1) Receiving the plurality of documents via a computing device

(2) Receiving user input from the computing device, the user input including hard coding [aka labeling] of a subset of the plurality of documents, the hard coding based on an identified subject or category [e.g. responsiveness, privilege, or issue]

(3) Executing instructions stored in memory, that:

(a) generates an initial control set based on the subset of the plurality of documents and the received user input on the subset

(b) analyzes the initial control set to determine at least one seed set parameter associated with the identified subject or category

(c ) automatically codes a first portion of the plurality of documents, based on the initial control set and the at least one set seed parameter associated with the identified subject or category

(d) analyzes the first portion of the plurality of documents by applying an adaptive identification cycle, the adaptive identification cycle being based on the initial control set, user validation of the automatic coding of the first portion of the plurality of documents and confidence threshold validation

(e ) retrieves a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle on the first portion of the plurality of documents

(f) adds further documents to the plurality of documents on a rolling load basis, and conducts a random sampling of initial control set documents both on a static basis and the rolling load basis

(4) receiving user input via the computing device, the user input comprising inspection, analysis and hard coding of the randomly sampled initial control set documents, and

(5) executing instructions stored in memory , wherein execution of the instructions by the processor automatically codes documents based on the received user input regarding the randomly sampled initial control set documents

So that appears to be the primary workflow, the primary patented claim.  Let’s compare and contrast that workflow with that of traditional relevance feedback. Though relevance feedback dates back to the early 1970s, here is a passage from the Introduction to Information Retrieval (published in 2008) describing the basic workflow:

The idea of relevance feedback is to involve the user in the retrieval process so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is:

  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval results.
  • The user marks some returned documents as relevant or nonrelevant.
  • The system computes a better representation of the information need based on the user feedback.
  • The system displays a revised set of retrieval results.

Relevance feedback can go through one or more iterations of this sort.

In other words, the relevance feedback workflow seems to do everything that the predictive coding workflow does.  It starts with a collection of documents. It selects a subset of those documents in some manner.  It presents those documents to a human annotator for expert labeling. Based on the labels provided by the human, the algorithm goes through an “adaptive identification cycle” in which it modifies itself so as to better align itself with the human understanding of the document labels. And, based on this adapted algorithm, it revises the set of results. That is, it recomputes the probabilities of the labels (relevance or nonrelevant, responsive or nonresponsive) for all the results.  Finally, it should be noted that the traditional, decades-old relevance feedback process workflow also is capable of iteration.

So what is the difference? I don’t just ask this rhetorically. I see a very strong similarity in the overall workflows between both predictive coding and relevance feedback, so I would honestly and transparently like to understand where the crucial differences are. If we are to understand what Recommind believes predictive coding to be–and if this understanding is going to help the courts set the legal precedent for defensible use of these technologies, a goal in which I fully agree with Recommind–then we really need to understand the process as a whole and what makes it unique.

The only thing I can think of is that there are a few occasions in the claimed predictive coding workflow that integrate random sampling and this is most likely to insure that the process is defensible. If that is the case, then how does that differ from active learning? Here is an example of the active learning workflow which incorporates uncertainty-based sampling, from a 2007 academic research paper by Andreas Vlachos, “A Stopping Criterion for Active Learning“:

Input:

seed labelled data L, unlabelled data U,

batch size b

Initialization:

Train a model on L

Active Learning Loop:

Until a stopping criterion is satisfied:

Apply the trained model classifier on U

Rank the instances in U using the uncertainty of the model

Annotate the top b instances and add them to L

Train the model on the expanded L

That is, instead of just presenting the expert user (e.g. lawyer) with the documents that have the highest probability of responsiveness, or of privilege, or of whatever issue they’ve been coded for, an active learning process or workflow explicitly seeks to add those document instances about which the learning algorithm is the most uncertain. That could mean documents for which the probability of that document’s label is relatively even or undistinguished (highest entropy) across all classes (in the case of generative machine learning models) or documents which lie the nearest to a decision boundary (in the case of discriminative machine learning models).

However, it could also mean that a document doesn’t lie near any boundary or have any probability estimate associated with it, because the appropriate signals have not yet been added to the model. In such cases, the best way–nay even the only way–of doing uncertainty sampling is to randomly sample from the collection, as random sampling helps you discover those documents, and therefore those decision boundaries, that you otherwise would not be aware of.  Thus, active learning as a general workflow pattern also incorporates random sampling.

So again, it is still not clear to me exactly what makes the Recommind predictive coding workflow unique, what distinguishes it from methods that have gone before, what its core characteristics are.  That isn’t to say that they don’t exist.  However, I believe further discussion is warranted, both in public as well as at workshops such as DESI (http://www.umiacs.umd.edu/~oard/desi4/), as this will serve to advance the market as a whole.  That is, I agree with Barry Murphy over at eDiscovery Journal that:

No matter what, this is good news for the eDiscovery market as a whole.  One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what “predictive coding” capabilities they do/do not have. While that might be a side-effect, it’s more likely that Recommind is trying to take the heat around predictive coding and have it warm up the vendor’s prospects more than anything else. We at eDJ take this as a call to better define what predictive coding is and what solutions need to offer to be valuable.

I take this as a call for vendors not only to define exactly what “predictive coding” capabilities they do/do not have, but for the industry as a whole to begin to set court-friendly guidelines around what predictive coding truly is.

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.

Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Information retrieval is discipline from the 1970s. Relational databases arrived in the 1960s. Most e-discovery platforms combine full text search (from information retrieval) and a relational database. What do you think is new and exciting in the world of e-discovery with tools that are 40 and 50 years old? Do you think there is a magic algorithm that will be used in e-discovery that will be as disruptive as Google PageRank was for broad Internet searching?

JEREMY PICKENS: There are a number of different angles from which one could approach this question. Recall from a previous blog post that one of the primary distinguishing factors between web search and e-discovery search is that the former is geared toward finding the one best answer, such as a factoid or a home page (precision-oriented), whereas the latter typically requires thousands if not millions of relevant (responsive) documents in order to satisfy an information need. This difference is not insignificant; it changes the entire nature of the search system being designed to meet that need.

Take PageRank, as per your example. It is important to understand that what makes PageRank work so well for web-oriented search has as much (if not more) to do with what the user is trying to accomplish as it does with the algorithm itself. Stop for a moment and read that sentence again. Web users typically want a single, best answer. And quickly. What is the best way to satisfy that information need? It is to give the web searcher a result that a lot of other people already think is pretty good, e.g. “votes” that come in the form of link data. If enough web pages link to single web page and use topically relevant keywords in that link’s anchortext, that web page will be boosted in the rankings. That web page will be “voted” to the top.

More to the point: The specific algorithm that is used to count those votes is not as important as simply having the votes in the first place. Having the votes is what moves your page from page 57 of the results to page 1. A better algorithm might move the page to rank #2 on page 1, rather than rank #9 on page 1. But 90% of what got that document to page 1 was the votes themselves, rather than the mathematics of how the votes were counted. And simply being on page 1 accounts for 90% of the success of PageRank, as typical web searchers will only look at the first page of results and almost never further.

In summary, it is not so much the PageRank algorithm (mathematics) that makes PageRank so successful. It is the signal (link “votes”) used as input to the algorithm; the signal correlates well with the ultimate user goal.

So the question is whether there will ever be a magic algorithm for e-discovery that will be as disruptive as PageRank. This is the same as asking whether there will ever be a single signal (such as a link “vote”) that correlates well with the user goal or intention. At the risk of making too bold of a claim, I think that the answer is no.

Jeremy Pickens

An e-discovery searcher’s information need simply does not fit the “magic bullet” profile. Someone engaged in e-discovery does not look at the first page of results and stop. That person (or a team of reviewers) may look at 20 pages. Or 100 pages. So whether one of the many available relevant documents is on page 1 or on page 57 matters much less. The user information need does not match what PageRank — or PageRank-like magic bullet algorithms — is trying to do.

Magic bullet algorithms try to get the absolute single best result (or small handful of few results) to the very top of the list. E-discovery users need thousands or millions of relevant results. And when there is that much information, there is going to be a huge diversity of signals and coordination between dozens of various algorithms to exhaustively find everything.

Please note, however, that this does not mean algorithmic approaches will not work for e-discovery. Quite the contrary; e-discovery is in need of more, better and smarter algorithms. And these algorithms will improve our ability and capacity to meet the e-discovery challenge. It is just that the algorithms developed will not be “magic bullet” algorithms. They will be like a well-coordinated orchestra, with dozens of components playing together in unison.

(Image: Felipe Micaroni Lalli per Creative Commons.)

Automatic Footers: Toothless Legal Verbiage Causes Search Headaches

“The contents of this email may be privileged and confidential and are intended for the use of the intended addressee(s) only. Unless you are the addressee, you may not use, copy or disclose to anyone the message or any information contained in the message. Under penalty of death, public ridicule, and death a second time you are legally obligated to: 1) Delete this email and all copies. 2) Destroy your computer and email server using fire, sledge hammer, and/or atomic weapon. 3) Bury the remains of step two in a haunted pet cemetery. 4) Confess to a religious leader of your choosing that you read an unintended electronic communication and promise never to do it again.”

For anyone who has generated or received an email from a law firm or major corporation, you’ve become accustomed to the obligatory paragraph of legal jargon that follows even the most rudimentary of emails. You’ve seen how that added paragraph, when multiplied during the course of normal correspondence, takes what should be a well-formatted email reply and turns the whole string into an endless chain that even M.C. Escher would be proud of.

This added text is something we put up with believing it is a necessary evil. We are under the belief that this paragraph could someday come to the rescue should we accidentally send a confidential email to Craigslist instead of Craig in accounting. Well world, be prepared to be shattered. According to a recent article in The Economist, Spare us the e-mail yada-yada, they are probably pointless.

They are assumed to be a wise precaution. But they are mostly, legally speaking, pointless. Lawyers and experts on internet policy say no court case has ever turned on the presence or absence of such an automatic e-mail footer in America, the most litigious of rich countries.

Many disclaimers are, in effect, seeking to impose a contractual obligation unilaterally, and thus are probably unenforceable. This is clear in Europe, where a directive from the European Commission tells the courts to strike out any unreasonable contractual obligation on a consumer if he has not freely negotiated it. And a footer stating that nothing in the e-mail should be used to break the law would be of no protection to a lawyer or financial adviser sending a message that did suggest something illegal.

How effective can something be that is automatically generated and unilaterally imposed? An unintended recipient could just as easily argue that the notice does not actually represent any sort of subjective intent to claim privilege since it is being used on emails that range from afternoon pizza orders to communications with opposing counsel.

What Does This Have to Do with Search?

Despite their likely futility, these blocks of text are not going anywhere. Instead, those of us in e-discovery need to accept that they are going to be in our collections and find solutions for how to work around them.

While “filler” text has implications to many culling and review analytics, it is most felt in identification of privileged documents. When every email in your collection contains the words “Privileged and Confidential” in the footer, a simple Boolean search is not going to cut it.

At Catalyst, we’ve approached this problem in a few different ways. The most effective is to temporarily alter the index in a way that excludes these likely “false positives.” In order for this to work, you need to devote some time to sampling your collection. Are there style encodings that designate footer text? Or is the footer text consistent across the collection so that it can be easily identified? Once you know what is to be removed, run your searches or other analytics, update for Potentially Privilege and restore the original index.

A less text-intensive approach would be to focus on the metadata in your collection. Go beyond your standard privilege search and look at who the author and recipients are in your collection. Ninety percent of privilege comes down to who sent it and who received it. Focus on communications sent solely between privileged parties. Other documents can be further classified according to likelihood of being privileged. For long communications spanning varied recipients, use email threading tools to identify where privilege breaks or is created.

With the right amount of planning and forethought when embarking on a document review strategy, even if these automated footers are useless, you won’t be.

‘Search’ is the Word

We’ve added something new to the Catalyst blog. You may have noticed it: a new word in the blog’s name. The word is “search.” Adding a single word may seem like a subtle change, but we see it as momentous. Let me explain.

We’ve been blogging for more than a year and watched as our readership has grown both in the United States and around the world. We’ve made new friends and had opportunities to give shout-outs to our favorite people and writers elsewhere as well. We have been flattered to have received the attention of revered sources such as the Wall Street Journal, the ABA Journal, Forbes, Law.com, Above the Law, and many of the top bloggers in our space.

Recently, I had a chance to meet with Bob Ambrogi, our communications director. Preparing for that meeting, I got to thinking about our purpose in writing what we called the Catalyst E-Discovery Blog. Something didn’t seem right about our focus.

While e-discovery is an important topic for trial lawyers and legal professionals, it covers a lot of ground. With our small band of merry writers, having day jobs to boot, we couldn’t hope to address that broad waterfront. Nor would we want to. Many others are already doing a good job of covering the growing e-discovery space.

Then it hit me, probably that morning in the shower where most of the lightning bolts strike. “We need to focus on search,” I said to myself. Search is at the heart of the e-discovery process and search is what gets us up to go to work. After all, Catalyst is primarily a search company. Just about everything we develop is built around search and what we harvest from search.

So, search is the word, at least from our perspective. With digital content growing like Topsy, legal teams don’t have a prayer of reviewing it all. Shoot, you can’t hope to review even a small percentage of what people are collecting these days. Without search, we would be in a world of hurt, at least for e-discovery.

Not only is search the word, it is also the law. I still recall reading the decisions in U.S. v. O’Keefe and Victor Stanley v. Creative Pipe Inc. U.S. Magistrate Judges John M. Facciola and Paul W. Grimm were attempting to send a radical message, one that sure caught my attention. “Search matters,” they said, backing that up with their rulings. If you don’t do search right, you are going to miss something important. If what you miss happens to be privileged material or something you should have produced to the other side, there will be consequences. Privilege may be revoked. Sanctions may be issued. Pay attention people!

I did. Shortly after reading those decisions, I started talking about how we at Catalyst might up our game. For more than a decade, we have provided one of the most powerful search engines in the industry. Still, we assumed our attorney users would do the heavy lifting. Not so, said Magistrates Facciola and Grimm. Search “is clearly beyond the ken of a layman,” not to mention of lawyers and judges, Magistrate Facciola cautioned in O’Keefe. What do they know about search? For that matter, what do any of us know other than what we learned using Lexis and Westlaw?

So, I rounded up some of the smartest people I know and formed the Catalyst Search and Analytics Consulting group. Their mission was to focus on honing search skills and to help our partners and clients get better at what they were doing. Best practices, tips for making privilege searches more effective and sampling techniques will all be featured in this blog.

We also brought in serious scientists with backgrounds in statistics, mathematics and deep-text mining. On top of that, we got involved with leading search think tanks, like The Center for Intelligent Systems and Machine Learning (CISML) at the University of Tennessee, Knoxville. Bruce Keifer, our director of research and development, and I even presented the keynote on advanced mathematical and statistical search techniques at Text Mining 2010, the annual workshop held in conjunction with the 2010 SIAM International Conference on Data Mining. We wanted to attend both to show them what we are doing and to solicit ideas to advance our own research. You will see that work discussed these blog pages.

Dedicating this Blog to Search

Search is everything in our world. When you log in, we are running a search against our security database. When you look at folders or document collections, we are searching to get their contents. When you click on the “More Like This” link, we are running an even more complex search. Clustering, predictive coding and email conversations are all defined by search. Some of it is about key words but, increasingly, more of it is about mathematics. Indeed, even if most don’t realize it, search engines are not really searching for words. Everything is hashed and turned into numbers beneath the surface. Ultimately, it is all ones and zeros.

This blog is dedicated to search. We hope to chronicle the developing law of search and make it practical and understandable for lawyers and other legal professionals. What are these cases saying and what do the judges want from us? I can’t promise perfect clarity here but we will sure try and make sense of these often differing decisions. Where logic can’t be found, we will say that too. At the very least, we will offer the “Catalyst Take” on what the courts are serving up.

We mean to go beyond the law here. Our goal is to provide tips, techniques and best practices for all kinds of searches—from privilege to production. After all, the stakes are high in the legal profession. Do this stuff wrong and you may be facing waiver of privilege, adverse inferences or even monetary sanctions. That’s bad news for you, your partners and your malpractice carrier. Your clients won’t be too happy with the results either.

Look for the Catalyst E-Discovery Search Blog to lead the way on topics such as:

  • The search cases: From O’Keefe to Mt. Hawley Insurance and beyond. Are lawyers qualified to craft and run their own search methodologies? If not, what should you be doing?
  • Defensibility of your search protocol. What must lawyers do to ensure that inadvertently produced privileged documents will be returned and not used in the case pursuant to FRE 502 and “clawback agreements.”
  • Tips and techniques to find privileged documents. Practical advice from search experts on how to find privileged documents and improve your search strategy.
  • Advanced analytical techniques for managing large document populations. New statistical and mathematical techniques to find relevant documents and pare down document populations to target relevant documents.

This isn’t just about defensibility. We’re passionate about effectiveness. How do we improve search effectiveness? Covering the Yin and Yang of search, we hope to arm you with better techniques to find relevant documents and discard those that don’t matter. When you are facing a large document population, weeding out spam can be as important as finding that smoking gun. Culling the batch to the few that actually need review is important both to keep down review costs and to meet tight deadlines.

Welcome to the Catalyst E-Discovery Search Blog. We hope you will find our content useful and come back often. We encourage comments, replies, challenges and tweets of all kinds. We are proud members of the legal search geeks community and offer this blog to our fellow travelers as our contribution to the topic. Search is an important part of the e-discovery process. We want to give it the serious, in-depth treatment it deserves. We also want to make it fun.

Catalyst Webinar: The Expanding Role of Search in E-Discovery

When searches fall short in e-discovery, the consequences can be serious. In recent years, courts have started closely scrutinizing lawyers’ searches–and even questioning their ability to craft thorough searches. When courts find flaws in a litigant’s search, they are increasingly likely to find waiver of attorney-client privilege, allow adverse inferences, order directed verdicts and impose sanctions that, in some cases, have run into the millions.

On May 23 at noon ET, Catalyst is hosting a free webinar, The Expanding Role of Search in E-Discovery. A panel of leading experts will review the key information you should understand about search in order to protect yourself and your clients. Topics will include:

  • The search cases: From O’Keefe to Hawley Insurance and beyond. Are lawyers qualified to craft and run their own search methodologies? If not, what should I be doing?
  • Defensibility of your search protocol. What must lawyers do to ensure that inadvertently produced privileged documents will be returned and not used in the case pursuant to FRE 502 and “clawback” agreements.
  • Tips and techniques to find privileged documents. Practical advice from search experts on how to find privileged documents and improve your search strategy.
  • Advanced analytical techniques for managing large document populations. New statistical and mathematical techniques to find relevant documents and pare down document populations to target relevant documents.

Speakers for the program will be:

  • Michael Arkfeld. Nationally renowned speaker and author of leading treatises on e-discovery, including Arkfeld on Electronic Discovery and Evidence (Lexis 3rd Ed.) and the annually update Best Practices Guide for Electronic Discovery and Evidence. Michael is a former assistant U.S. attorney with 20 years of trial experience.
  • Charles W. Cohen. Partner at Hughes Hubbard & Reed LLP, where he is a co-chair of the eDiscovery Practice Group and a co-chair of the firm’s Technology Committee. Mr. Cohen has a national and international litigation and e-discovery practice, and has appeared in numerous federal and state courts across the country.
  • John Tredennick. Author of five books and countless articles on legal technology and electronic discovery issues who has spoken to legal audiences on five continents. After 20 years as a trial lawyer and litigation partner for an AmLaw 200 firm, John founded Catalyst Repository Systems, which provides secure, hosted document repositories for electronic discovery.

Read more about this webinar or register now.