Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

The Next Big Predictive Coding Case that Wasn’t

The case that many believed might be the next big bang in predictive coding jurisprudence instead has ended with barely a whimper.

As I noted here last month, in the wake of Magistrate Judge Andrew J. Peck’s ruling in Da Silva Moore v. Publicis Groupe affirming the use of predictive coding, many in the e-discovery field turned their attention to Kleen Products LLC v. Packaging Corporation of America, believing that it might be the Next Big Case on predictive coding.

The plaintiffs in Kleen Products had asked U.S. Magistrate Judge Nan Nolan to require the defendants to use predictive coding and Judge Nolan had conducted two days of evidentiary hearings on the request as well as several status conferences.

Although the case continues on, predictive coding is off the table, at least for the time being. Last week, Judge Nolan approved a stipulation submitted by the parties in which plaintiffs withdrew their demand to apply predictive coding to any documents relating to any request for production filed prior to Oct. 1, 2013.

As to any requests for production filed after that date, the parties stipulated that they will meet and confer regarding the appropriate search methodology. “If the parties fail to agree on a search methodology,” the stipulation says, “either party may file a motion with the Court seeking resolution.”

That suggests that we may not have heard the last of Kleen Products in the context of computer-assisted search. But with any further possible rulings on the issue well over a year away, we can safely write it off as the next big case.

Use of Predictive Coding and the Cloud in E-Discovery Rose in 2012, Survey Says

Use of predictive coding and Internet-based electronic discovery tools rose in 2012, according to the recently published 2012 ABA Legal Technology Survey Report on litigation and courtroom technology.

Of lawyers whose firm had handled an e-discovery case, 44 percent said they had used Internet-based e-discovery tools, up from 31 percent in 2011. Thirty-five percent said they had used Internet-based litigation-support software, up from 24 percent in 2011. Of those same lawyers, 23 percent said they had used predictive coding to process or review e-discovery materials, up from 15 percent the prior year.

By comparison, lawyers’ use of desktop-based e-discovery tools rose only slightly, from 46 percent to 48 percent, and their use of desktop-based litigation support software held steady, at 46 percent.

Not surprisingly, the use of these types of e-discovery tools is far more common among lawyers in larger firms than among those in solo and small firms. Among lawyers whose firm has handled e-discovery matters, only 5 percent of solo lawyers and 6 percent of lawyers in firms of 2-9 lawyers say they’ve used predictive coding. By contrast, in firms of 500 or more lawyers, 43.5 percent report having used predictive coding.

A similar but less-dramatic gap exists when lawyers who have handled e-discovery matters were asked if they ever use Internet-based e-discovery tools. Among lawyers in firms of 500 or more, 67.3 percent say they’ve used these tools. Among lawyers in solo firms, 33.3 percent say they have.

In fact, solo and small-firm lawyers are far less likely than their larger-firm counterparts to have ever handled an e-discovery matter. When asked how often they had made an e-discovery request on behalf of a client, 64.2 percent of solo lawyers said never. At firms of 500 or more, only 31.3 percent answered never.

Along the same lines, lawyers were asked how often they had received e-discovery requests on behalf of clients. Of solo lawyers, 56.1 percent said never. At firms of 500 or more, 27.9 percent said never.

Another question asked whether the lawyer’s firm (as opposed to the lawyer directly) had ever been involved in a case that required the processing or review of e-discovery materials. Only 12.8 percent of solos and 34.3 percent of lawyers in firms of 2-9 lawyers answered yes. Of lawyers in firms of 500 or more, 71 percent said yes. Among all respondents in all sized firms, 43.8 percent said that their firms had been involved in an e-discovery matter.

On the topic of outsourcing, the survey asked lawyers whether they outsource e-discovery processing or review. The results show little change in outsourcing to e-discovery consultants and companies — 44 percent in 2012 compared to 45 percent in 2011. Likewise, the percentage of outsourcing to computer forensics specialists remained steady at 42 percent from 2011 to 2012.

However, the survey indicates that outsourcing to lawyers outside their own firm is on the rise. Outsourcing to lawyers within the United States rose from 16 percent in 2011 to 25 percent in 2012. Outsourcing to lawyers outside the United States rose from 3 percent in 2011 to 8 percent in 2012. Here again, the larger the firm, the more likely the lawyer is to outsource.

Something that surprised me in the survey is that there has been virtually no change over the past three years in the number of firms reporting that they have a distinct e-discovery initiative (such as a practice group). In 2012, 25 percent of respondents said their firms had such an initiative, down from 27 percent in 2011 and equal with 2010′s 25 percent. Also notable is that, among firms that have such an initiative, fewer of them report having a partner heading it up. Increasingly, the firm’s CIO is taking on primary responsibility for its e-discovery initiative.

The 2012 ABA Legal Technology Survey Report consists of six volumes, covering a range of topics from technology basics to mobile lawyering. The e-discovery results are contained in Volume III, which covers litigation and courtroom technology. Volume III is available for purchase from the ABA for $350 (or $300 for ABA members). An abbreviated trend report on litigation and courtroom technology can be purchased for $55 (or $45 for ABA members).

Achieving Efficiency in E-Discovery: Argyle’s Interview with John Tredennick

Recently, Scott Robbin of the Argyle Executive Forum interviewed Catalyst’s founder and CEO John Tredennick about how to make e-discovery more efficient and the advantages of using one vendor to provide a central repository to manage e-discovery documents. Below is their conversation.

Courts Should Consider Search Technology, Say New Penn. E-Discovery Rules

The Supreme Court of Pennsylvania

The Supreme Court of Pennsylvania has adopted new e-discovery rules that expressly distance federal e-discovery jurisprudence and instead emphasize “traditional principles of proportionality under Pennsylvania law.” Notably, the new rules provide that, when weighing proportionality, parties and courts should consider electronic search and sampling technology, among other factors.

The court promulgated the new e-discovery rules June 6 as amendments to the Pennsylvania Rules of Civil Procedure. They take effect Aug. 1, 2012.

The most significant change is to Rule 4009.1, governing requests for the production of documents and things. The current rule defines “documents” as including:

electronically created data, and other compilations of data from which information can be obtained, translated, if necessary, by the respondent party or person upon whom the request or subpoena is served through detection or recovery devices into reasonably usable form.

The amendment deletes this entire phrase and replaces it with the simpler phrase, “electronically stored information.” The amended rule will now read:

Any party may serve a request upon a party pursuant to Rules 4009.11 and 4009.12 or a subpoena upon a person not a party pursuant to Rules 4009.21 through 4009.27 to produce and permit the requesting party, or someone acting on the party’s behalf, to inspect and copy any designated documents (including writings, drawings, graphs, charts, photographs, and electronically stored information), or to inspect, copy, test or sample any tangible things or electronically stored information, which constitute or contain matters within the scope of Rules 4003.1 through 4003.6 inclusive and which are in the possession, custody or control of the party or person upon whom the request or subpoena is served, and may do so one or more times.

But while the rule adopts the phrase used in the federal rules, the official comment makes clear that the court’s intent is not to adopt federal e-discovery law:

Though the term “electronically stored information” is used in these rules, there is no intent to incorporate the federal jurisprudence surrounding the discovery of electronically stored information. The treatment of such issues is to be determined by traditional principles of proportionality under Pennsylvania law as discussed in further detail below.

One other significant change to the rule is addition of a new subparagraph (b) to Rule 4009.1 which addresses the form of production. The new rule says that the party requesting ESI may specify the format in which it is to be produced, to which the responding party may object. If the requesting party does not specify a format, then the ESI may be produced “in the form in which it is ordinarily maintained or in a reasonably usable form.”

Proportionality Should Prevail

The official comment to the amended rules emphasizes the importance of proportionality in determining the scope of discovery obligations. The overarching goal of the rules, the comment says, is to ensure that discovery is conducted in a manner that is “consistent with the just, speedy and inexpensive determination and resolution of litigation disputes.” To that end, the comment continues, courts faced with discovery disputes should consider five factors:

  1. The nature and scope of the litigation, including the importance and complexity of the issues and the amounts at stake.
  2. The relevance of ESI and its importance to the court’s adjudication in the given case.
  3. The cost, burden, and delay that may be imposed on the parties to deal with ESI.
  4. The ease of producing ESI and whether substantially similar information is available with less burden.
  5. Any other factors relevant under the circumstances.

The comment goes on to identify what it describes as “tools for addressing” ESI. It says:

Parties and courts may consider tools such as electronic searching, sampling, cost sharing, and non-waiver agreements to fairly allocate discovery burdens and costs. When utilizing non-waiver agreements, parties may wish to incorporate those agreements into court orders to maximize protection vis-à-vis third parties.

This language leaves much to interpretation. Even so, it clearly encourages courts and parties to take technology into consideration when weighing discovery burdens and costs. Implicit in this, it seems fair to say, is the court’s recognition that search, sampling and tools such as predictive coding can significantly reduce both the burden and cost of e-discovery.

With these new rules, Pennsylvania’s Supreme Court has made clear its intent to chart its own route on e-discovery, independent of federal jurisprudence. It will be interesting to see how this course develops. Even so, in their own way, these new rules add to the growing body of law that recognizes the increasingly essential link between sophisticated technology and cost-effective e-discovery.

Two Catalyst Researchers Coauthor Book on Next-Generation Search

Two leaders of research and development here at Catalyst have helped write a seminal new book on search and information retrieval, Next Generation Search Engines: Advanced Models for Information Retrieval, released in March by IGI Global.

Next Generation Search EnginesBruce Kiefer, leader of the platform group at Catalyst, and Reed Esau, a platform architect focused on research and development at Catalyst, together with Michael W. Berry, associate director of the Center for Intelligent Systems & Machine Learning at the University of Tennessee, co-authored the book’s chapter, “The Use of Text Mining Techniques in Electronic Discovery for Legal Matters.”

As volumes of e-discovery data have outgrown the manual processes long used to make relevance judgments, Kiefer and Esau explain how methods of text mining and information retrieval, including predictive coding, can be used to help reduce data volumes. Acknowledging that text-mining techniques have so far delivered uneven results, they start the chapter by looking at the historical bias of the collection process. They then examine how tools like classifiers, latent semantic analysis, and non-negative matrix factorization can deal with nuances of the collection process.

Their chapter is part of a book intended for scientists and decision-makers who wish to gain working knowledge about search in order to evaluate available options and to engage in a dialogue with software and data providers. The aim of the book is to give readers a better understanding of the latest trends in applied research.

The book is available for purchase in hardcover and e-book editions from IGI Global (www.igi-global.com). The individual chapter can be purchased separately as a PDF download.

For more information about the book and the authors, see the full announcement.

 

Catalyst’s Jim Eidelman Discusses Predictive Coding in ‘Law Technology News’

Now that U.S. District Judge Andrew L. Carter Jr. has affirmed the groundbreaking predictive coding order issued by U.S. Magistrate Judge Andrew J. Peck in Da Silva Moore v. Publicis Groupe, Law Technology News reporter Evan Koblentz went back and spoke to leading professionals in the legal technology field for their reactions. You can read his story here: Take Two: Reactions to ‘Da Silva Moore’ Predictive Coding Order.

Jim Eidelman

One of the people Koblentz quotes is Catalyst’s own Jim Eidelman, senior search and analytics consultant on the Catalyst Search & Analytics Consulting team. These court decisions gave predictive coding “a legitimacy that was needed,” Eidelman told Koblentz. But before predictive coding can fully enter the mainstream, engineers need to work out some of the technology’s limitations, he said.

“Obviously it is all about the process, the sampling, and the use of common sense,” Eidelman said. “Some documents can only be found other ways, and predictive coding isn’t a universal solution. Clearly multi-mode searching and review is required in every case, with or without da Silva.”

Eidelman goes on to discuss what he says is “one of the big defensibility issues nobody is talking about.” That issue is pre-culling using keyword searching — something that can leave relevant documents behind and taint the process.

“Other big issues are sampling methodologies, how multiple issues are handled, and attorney-client privilege,” Eidelman says in the article. “There is still so much to be worked out. We are just at the infancy of the machine learning applied to e-discovery documents, even though ‘relevance feedback’ has been used in other areas for decades.”

Read the full article with reactions from a number of technology professionals at Law Technology News.

Federal Court Affirms Judge Peck’s Predictive Coding Order

Judge Carter

When last we left the case of Da Silva Moore v. Publicis Groupe–the groundbreaking case in which U.S. Magistrate Judge Andrew J. Peck issued the first judicial opinion to endorse the use of computer-assisted review and predictive coding–it was headed for review by U.S. District Judge Andrew L. Carter Jr. Now, thanks to a heads-up from Evan Koblentz at Law Technology News, we learn that Judge Carter has issued his ruling and has adopted Judge Peck’s opinion.

“The Court adopts Judge Peck’s rulings because they are well reasoned and they consider the potential advantages and pitfalls of the predictive coding software,” Judge Carter wrote in an opinion filed today.

In challenging Judge Peck’s order, the plaintiffs had argued that he had mischaracterized and confused the issue of whether they had consented to the use of predictive coding. Judge Carter concluded that any such confusion was immaterial.

The confusion is immaterial because the ESI protocol contains standards for measuring the reliability of the process and the protocol builds in levels of participation by Plaintiffs. It provides that the search methods will be carefully crafted and tested for quality assurance, with Plaintiffs participating in their implementation. For example, Plaintiffs’ counsel may provide keywords and review the documents and the issue coding before the production is made. If there is a concern with the relevance of the culled documents, the parties may raise the issue before Judge Peck before the final production. Further, upon the receipt of the production, if Plaintiffs determine that they are missing relevant documents, they may revisit the issue of whether the software is the best method.

Plaintiffs also challenged Judge Peck’s order on the ground that predictive coding is not a reliable method. Judge Carter ruled that this issue is also premature. As the litigation continues, if the parties believe the predictive coding software is flawed or that the process produces incomplete results, they can raise their concerns with Judge Peck and ask him to reconsider, Judge Carter noted. “To call the method unreliable at this stage is speculative.”

There simply is no review tool that guarantees perfection. The parties and Judge Peck have acknowledged that there are risks inherent in any method of reviewing electronic documents. Manual review with keyword searches is costly, though appropriate in certain situations. However, even if all parties here were willing to entertain the notion of manually reviewing the documents, such review is prone to human error and marred with inconsistencies from the various attorneys’ determination of whether a document is responsive. Judge Peck concluded that under the circumstances ofthis particular case, the use of the predictive coding software as specified in the ESI protocol is more appropriate than keyword searching. The Court does not find a basis to hold that his conclusion is clearly erroneous or  contrary to law.

As to that secondary issue I mentioned in an earlier blog post–whether Rule 702 and Daubert apply to a court’s acceptance of a predictive-coding protocol–Judge Carter made short work of that. In a footnote, he wrote: “The Court adopts Judge Peck’s analysis of Rule 26(g) and Fed. R. Evidence 702 for similar reasons provided in his written opinion.”

Thus, Judge Peck’s predictive coding order has stood its ground and, with Judge Carter’s adoption of his reasoning, the use of predictive coding has taken another giant step towards the mainstream.

Catalyst Named Leading Provider of Predictive Coding Technology

Complex Discovery, a blog focused on the intersection of electronic discovery and social media, has named Catalyst a leading provider of predictive coding technology.

The blog’s author, Rob Robinson, compiled the list of 22 predictive coding providers based on his review of information from leading providers of e-discovery technology.

See Rob’s full list here: 20+ Predictive Coding Technology Providers.

Should the ‘Daubert’ Standard Apply to Predictive Coding? We May Know Soon

It’s been a month since U.S. Magistrate Judge Andrew J. Peck issued his seminal opinion on predictive coding, Da Silva Moore v. Publicis Groupe, and it continues to make waves. Notably, it appears that U.S. District Judge Andrew L. Carter Jr. will weigh in on the issue. On March 13, he entered an order granting plaintiffs’ request to submit additional briefing on their objections to Judge Peck’s order.

A key issue Judge Carter may need to address is one given short shrift in coverage of and commentary on Judge Peck’s opinion. Understandably, most of the commentary focused on the fact that Judge Peck’s opinion marked a milestone — the first judicial opinion to recognize that computer-assisted review is an acceptable way to search for electronically stored information.

But in the course of that opinion, Judge Peck made another significant ruling. He concluded that Federal Rule of Evidence 702 and the Supreme Court’s decision in Daubert v. Merrell Dow Pharmaceuticals do not apply to a court’s acceptance of a predictive-coding protocol.

Rule 702 and Daubert give trial judges the responsibility to act as “gatekeepers” to exclude unreliable scientific and technical expert testimony. Judge Peck reasoned that these did not apply to the Da Silva Moore case because no one was trying to put anything into evidence. Here is how he explained it:

If MSL sought to have its expert testify at trial and introduce the results of its ESI protocol into evidence, Daubert and Rule 702 would apply. Here, in contrast, the tens of thousands of emails that will be produced in discovery are not being offered into evidence at trial as the result of a scientific process or otherwise. The admissibility of specific emails at trial will depend upon each email itself (for example, whether it is hearsay, or a business record or party admission), not how it was found during discovery.

Rule 702 and Daubert simply are not applicable to how documents are searched for and found in discovery.

You may recall that before Judge Peck issued his written opinion in this case on Feb. 22, he made oral rulings at the motion hearing on Feb. 8. On Feb. 22, just as Judge Peck was issuing his written opinion, the plaintiffs filed objections to his Feb. 8 rulings. One of their central arguments was that Judge Peck erred in disregarding his gatekeeper role under Daubert.

Because predictive coding is a new and novel technology, they argued, Judge Peck should have required expert testimony regarding its reliability or appropriateness. They cite Magistrate Judge Paul Grimm’s well-known ruling in Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251, 260 n.10 (D. Md. 2008), where he said, “[R]esolving contested issues of whether a particular search and information retrieval method was appropriate … involves scientific, technical or specialized information.” Relying on this, the plaintiffs argued:

[A]t no point did the Magistrate review any evidence to support his decision. The Magistrate took no judicial notice of any documents or studies that support the reliability of MSL’s method, nor did he receive any affidavits or declarations from purported experts that supported the methodology of MSL’s method. To his credit, the Magistrate did ask the parties to bring the ESI experts they had hired to advise them regarding the creation of an ESI protocol. These experts, however, were never sworn in, and thus the statements they made in court at the hearings were not sworn testimony made under penalty of perjury. The Magistrate judge never asked for or evaluated the qualifications of these experts, nor were the parties given an opportunity to question or cross-examine the experts in order for the Court to make a finding regarding the reliability of the experts’ opinions. Thus, the Magistrate’s decision relies only on the arguments made by counsel.

On March 7, the defendants responded to plaintiffs’ objections. With regard to the Daubert issue, they took the same position as Judge Peck–that Rule 702 and Daubert do not apply to the methods used to take discovery, but only kick in when evidence is presented at trial.

Plaintiffs simply are incorrect in their assertion that Victor Stanley requires expert testimony regarding the methodology selected by a party to search for electronically stored information. Rather, this case only requires that the selected methodology was carefully planned by qualified persons, contains provisions for quality assurance, and is supported by persons with the requisite qualifications and experience.

After receiving the defendants’ response, the plaintiffs wrote to Judge Carter on March 9 asking for leave to file their own response.

[W]hile Plaintiffs were denied an opportunity to respond to Magistrate Judge Peck’s written opinion, MSL had the benefit of filing its opposition approximately two weeks after the written opinion had been issued. Indeed, MSL’s opposition brief and supporting expert declarations not only reference, but also largely rely upon Magistrate Judge Peck’s observations regarding Plaintiffs’ Objection.

Judge Carter granted the plaintiffs’ request on March 13. (The brief was due March 19.) That means that we can expect him to issue a ruling of his own. It seems unavoidable that any ruling he issues will address the core issue of the appropriateness of computer-assisted review, at least in this case. Most likely, he will also have to address this secondary issue of the applicability of Daubert. If he does, in fact, squarely address these issues–and regardless of whether he agrees with Judge Peck–his ruling will be yet another milestone for predictive coding.

Judge Peck Provides a Primer on Computer-Assisted Review

Magistrate Judge Andrew J. Peck issued a landmark decision in Monique Da Silva Moore v. MSL Group, filed on Feb. 24, 2012. This much-blogged-about decision made headlines as being the first judicial opinion to approve the process of “predictive coding,” which is one of the many terms people use to describe computer-assisted coding.

Well, Judge Peck did just that. As he hinted during his presentations at LegalTech, this was the first time a court had the opportunity to consider the propriety of computer-assisted coding. Without hesitation, Judge Peck ushered us into the next generation of e-discovery review—people assisted by a friendly robot. That set the e-discovery blogosphere buzzing, as Bob Ambrogi pointed out in an earlier post.

I recommend reading the decision (and its accompanying predictive-coding protocol) not for its result but for its reasoning. This is one of the best sources I have seen on the reasons for and processes underlying predictive coding. Indeed, Judge Peck provided a primer on how to conduct predictive coding that is must reading for anyone wanting to get up to speed on this process.

What is Computer-Assisted Review?

Judge Peck started by quoting from his earlier article in Law Technology News:

By computer-assisted coding, I mean tools (different vendors use different names) that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with (i.e. training by) a human reviewer.

As Judge Peck concluded: “This judicial opinion now recognizes that computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.”

Why Do We Need Computer-Assisted Review?

The answer for Judge Peck was simple: Other methods of finding relevant documents are expensive and less effective. As he explained:

  • The objective of e-discovery is to identify as many relevant documents as possible while reviewing as few non-relevant documents as possible.
  • Linear review is often too expensive. Despite being seen as the “gold standard,” studies show that computerized searches underlying predictive coding are at least as accurate as human review, if not more accurate.
  • Studies also show a high rate of disagreement among human reviewers as to whether a document is relevant. In most cases, the difference is attributable to human error or fatigue.
  • Key word searches to reduce data sets also miss a large percentage of relevant documents. The typical practice of opposing parties choosing keywords resembles a game of “Go Fish,” as Ralph Losey once pointed out.
  • Key word searches are often over-inclusive, finding large numbers of irrelevant documents that increase review costs. They can also be under-inclusive, missing relevant documents. In one key study the recall rate was just 20%.

Ultimately, Judge Peck reminded us of the goals underlying the Federal Rules of Civil Procedure. Perfection is not required. The goal is the “just, speedy, and inexpensive determination” of lawsuits.

Judge Peck concluded that the use of predictive coding was appropriate in this case for the following reasons:

  1. The parties’ agreement.
  2. The vast amount of ESI (over 3 million documents).
  3. The superiority of computer-assisted review over manual review or keyword searches.
  4. The need for cost effectiveness and proportionality.
  5. The transparent process proposed by the parties.

The last point was perhaps the most important factor leading to the decision: “MSL’s transparency in its proposed ESI search protocol made it easier for the Court to approve the use of predictive coding.”

How Does the Process Work?

The court attached the parties’ proposed protocol to the opinion. While it does not represent the only way to do computer-assisted review, it provides a helpful look into how the process works.

  1. The process in this case began with attorneys developing an understanding of the files and identifying a small number that will function as an initial seed set representative of the categories to be reviewed and coded. There are a number of ways to develop the seed set, including the use of search tools and other filters, interviews, key custodian review, etc. You can see more on this subject below.
  2. Opposing counsel should be advised of the hit counts and keyword searches used to develop the seed set and invited to submit their own keywords. They should also be provided with the resulting seed documents and allowed to review and comment on the coding done on the seed documents.
  3. The seed sets are then used to begin the predictive coding process. Each seed set (one per issue being reviewed) is used to begin training the software.
  4. The software uses each seed set to identify and prioritize all similar documents over the complete corpus under review. Essentially, they review at least 500 of the computer-selected documents to confirm that the computer is properly categorizing the documents. This is a calibration process.
  5. Transparency requires that opposing counsel be given a chance to review all non-privileged documents used in the calibration process. If the parties disagree on tagging, they meet and confer to resolve the dispute.
  6. At the conclusion of the training process, the system then identifies relevant documents from the larger set. These documents are reviewed manually for production. In this case, the producing party reserved the right to seek relief should too many documents be identified.
  7. Accuracy during the process should be tested and quality controlled by both judgmental and statistical sampling.
  8. Statistical sampling involves a small set of documents randomly selected from the total files to be tested. That allows the parties to project error rates from the sample.
  9. Here, the parties agreed on a series of issues that will, of necessity, vary on other cases. The key point is that the parties agree on the issues and test the coding during the process.

Random Samples

It is important to create an initial random sample from the entire document set. The parties used a 95% confidence level with an error margin of 2%. They determined that the sample size should be 2,399 documents. You can figure this out using one of the publicly available sample-size calculators such as Raosoft, which we often use:

Seed Sets

The protocol goes on to describe a number of ways to generate seed sets including:

  • Agreed-upon search terms.
  • Judgmental analysis.
  • Concept search.

The parties frequently sampled the results from searches to evaluate their effectiveness.

There is at least a good blog post to be written about seed sets. Some computer-assisted coding systems like the one used for this case start their process with seed sets. The notion is that attorneys understand the cases, know what is and is not relevant and can train the system to recognize more relevant documents more effectively than starting with no seed documents.

Others think this is a mistake. They believe that however well meaning, the attorneys will bias the system to find what they think is relevant and get self-reinforcing results. In this regard, they are suggesting that the attorneys will make the same mistakes found in key word searches—thinking that you know which words will be most effective at finding your documents.

Systems following this logic urge the user to start from scratch, telling the system what is and is not relevant based on reviewing documents. As you do that, the system begins developing its own profile of relevant documents and builds out the searches. The belief is that the system may create a better search through this process than it might if you bias it with your seed documents.

There is a middle ground here as well. Many of the latter systems (no seed) will allow you to submit a limited number of seed documents as part of the training process. That may represent the best of both worlds or it may not, depending on your beliefs. The important point is that there are different approaches to computer-assisted processing. This protocol shows you one approach only.

Training Iterations

The process involves a number of computer runs to find responsive documents. The parties started with a first set of potentially relevant documents based on analysis of the seed set. After that review, the computer was asked to consider the new tagging and find a second set for testing. Then a third and a fourth.

The protocol suggested that the parties run through this process seven times. The key is to watch the change in the number of relevant documents predicted by the system after each round of testing. Once that number dropped below a delta of 5%, the parties had the option to stop. The notion is that the system has become stable by that time, with further review unlikely to uncover many more relevant documents.

Finishing the Process

Once the training has completed and the system is “stable,” we move from computer-assisted to human-powered review. At that point, the producing party reviews all of the potentially responsive documents and produces accordingly.

Final QC Protocol

As a final stage, the parties need to focus on the potentially non-responsive documents—the ones the system says to ignore. The parties select a random sample (2,399 documents again) to see how many were, in fact, responsive.

These same documents (non-privileged ones) must be produced to the opposing party for review. If that party finds too many responsive documents in the sample or otherwise objects, it is time for a meet-and-confer to resolve the dispute. Failing that, you can always go to the court and fight it out.

Is This the Bible on Predictive Coding?

Certainly not. There are a lot of ways to approach this process. However, first opinions on any topic carry a lot of weight. We chose a profession that is guided by precedent, and these are first tracks on this new and exciting subject. The suggested procedures make sense to me and provide a starting point for your predictive coding efforts. This opinion and its accompanying protocol are important reading whether you are proposing or opposing the process for your next case.