Last month, two of the leading experts on e-discovery, Maura R. Grossman and Gordon V. Cormack, presented a peer-reviewed study on continuous active learning to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM), “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.”
In the study, they compared three TAR protocols, testing them across eight different cases. Two of the three protocols, Simple Passive Learning (SPL) and Simple Active Learning (SAL), are typically associated with early approaches to predictive coding, which we call TAR 1.0. The third, continuous active learning (CAL), is a central part of a newer approach to predictive coding, which we call TAR 2.0. Continue reading
Maura Grossman and Gordon Cormack just released another blockbuster article, “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’” 7 Federal Courts Law Review 286 (2014). The article was in part a response to an earlier article in the same journal by Karl Schieneman and Thomas Gricks, in which they asserted that Rule 26(g) imposes “unique obligations” on parties using TAR for document productions and suggested using techniques we associate with TAR 1.0 including: Continue reading
This past weekend I received an advance copy of a new research paper prepared by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” They have posted an author’s copy here.
The study attempted to answer one of the more important questions surrounding TAR methodology: Continue reading
I read with great interest a recent article in Law Technology News, “Four Examples of Predictive Coding Success,” by Barclay T. Blair.
The purpose of the article was to report on several successful uses of technology-assisted review. While that was interesting, my attention was drawn to another aspect of the report. Three of the case studies provided data shedding further light on that persistent e-discovery mystery: “How many documents in a gigabyte?” Continue reading
[This article originally appeared in the Winter 2014 issue of EDDE Journal, a publication of the E-Discovery and Digital Evidence Committee of the ABA Section of Science and Technology Law.]
Although still relatively new, technology-assisted review (TAR) has become a game changer for electronic discovery. This is no surprise. With digital content exploding at unimagined rates, the cost of review has skyrocketed, now accounting for over 70% of discovery costs. In this environment, a process that promises to cut review costs is sure to draw interest, as TAR, indeed, has.
Called by various names—including predictive coding, predictive ranking, and computer-assisted review—TAR has become a central consideration for clients facing large-scale document review. It originally gained favor for use in pre-production reviews, providing a statistical basis to cut review time by half or more. It gained further momentum in 2012, when federal and state courts first recognized the legal validity of the process. Continue reading
Predictive Ranking, aka predictive coding or technology-assisted review, has revolutionized electronic discovery–at least in mindshare if not actual use. It now dominates the dais for discovery programs, and has since 2012 when the first judicial decisions approving the process came out. Its promise of dramatically reduced review costs is top of mind today for general counsel. For review companies, the worry is about declining business once these concepts really take hold.
While there are several “Predictive Coding for Dummies” books on the market, I still see a lot of confusion among my colleagues about how this process works. To be sure, the mathematics are complicated, but the techniques and workflow are not that difficult to understand. I write this article with the hope of clarifying some of the more basic questions about TAR methodologies. Continue reading
On Jan. 24, Law Technology News published John’s article, “Five Myths about Technology Assisted Review.” The article challenged several conventional assumptions about the predictive coding process and generated a lot of interest and a bit of dyspepsia too. At the least, it got some good discussions going and perhaps nudged the status quo a bit in the balance.
One writer, Roe Frazer, took issue with our views in a blog post he wrote. Apparently, he tried to post his comments with Law Technology News but was unsuccessful. Instead, he posted his reaction on the blog of his company, Cicayda. We would have responded there but we don’t see a spot for replies on that blog either. Continue reading
For an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”
At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar. Continue reading
The big dog today is electronic discovery.
There has been debate lately about the proper spelling of the shorthand version for electronic discovery. Is it E-Discovery or e-discovery or Ediscovery or eDiscovery? Our friends at DSIcovery recently posted on that topic and it got me thinking.
The industry seems to be of differing minds. Several of the leading legal and business publications use e-discovery, as do we. They include Law Technology News, the other ALM publications, the Wall Street Journal (see here, for example), the ABA Journal (example), Information Week (example) and Law360 (example).
Also using e-discovery are industry analysts such as Gartner and 451 Research.
A number of vendors favor the non-hyphenated versions Continue reading
One of the givens of traditional CAR (computer-assisted review) in e-discovery is the need for random samples throughout the process. We use these samples to estimate the initial richness of the collection (specifically, how many relevant documents we might expect to see). We also use random samples for training, to make sure we don’t bias the training process through our own ideas about what is and is not relevant.
Later in the process, we use simple random samples to determine whether our CAR succeeded. Continue reading