In July 2014, attorney Maura Grossman and professor Gordon Cormack introduced a new protocol for Technology Assisted Review that they showed could cut review time and costs substantially. Called Continuous Active Learning (“CAL”), this new approach differed from traditional TAR methods because it employed continuous learning throughout the review, rather than the one-time training used by most TAR technologies.
Barbra Streisand in ‘A Star is Born’
Their peer-reviewed research paper, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” also showed that using random documents was the least effective method for training a TAR system. Overall, they showed that CAL solved a number of real-world problems that had bedeviled review managers using TAR 1.0 protocols.
Not surprisingly, their research caused a stir. Some heralded its common-sense findings about continuous learning and the inefficiency of using random seeds for training. Others challenged the results, arguing that one-time training is good enough and that using random seeds eliminates bias. We were pleased that it confirmed our earlier research and legitimized our approach, which we call TAR 2.0. Continue reading
In Part One of this two-part post, I introduced readers to statistical problems inherent in proving the level of recall reached in a Technology Assisted Review (TAR) project. Specifically, I showed that the confidence intervals around an asserted recall percentage could be sufficiently large with typical sample sizes as to undercut the basic assertion used to justify your TAR cutoff.
In our hypothetical example, we had to acknowledge that while our point estimate suggested we had found 75% of the relevant documents in the collection, it was possible that we found only a far lower percentage. For example, with a sample size of 600 documents, the lower bound of our confidence interval was 40%. If we increased the sample size to 2,400 documents, the lower bound only increased to 54%. And, if we upped our sample to 9,500 documents, we got the lower bound to 63%.
Even assuming that 63% as a lower bound is enough, we would have a lot of documents to sample. Using basic assumptions about cost and productivity, we concluded that we might spend 95 hours to review our sample at a cost of about $20,000. If the sample didn’t prove out our hoped-for recall level (or if we received more documents to review), we might have to run the sample several times. That is a problem.
Is there a better and cheaper way to prove recall in a statistically sound manner? In this Part Two, I will take a look at some of the other approaches people have put forward and see how they match up. However, as Maura Grossman and Gordon Cormack warned in “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review’” and Bill Dimm amplified in a later post on the subject, there is no free lunch. Continue reading
As most e-discovery professionals know, two leading experts in technology assisted review, Maura R. Grossman and Gordon V. Cormack, recently presented the first peer-reviewed scientific study on the effectiveness of several TAR protocols, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM).
Perhaps the most important conclusion of the study was that an advanced TAR 2.0 protocol, continuous active learning (CAL), proved to be far more effective than the two standard TAR 1.0 protocols used by most of the early products on the market today—simple passive learning (SPL) and simple active learning (SAL). Continue reading
“Which is it to-day,” [Watson] asked, “morphine or cocaine?”
[Sherlock] raised his eyes languidly from the old black-letter volume which he had opened.
“It is cocaine,” he said, “a seven-per-cent solution. Would you care to try it?”
-The Sign of the Four, Sir Arthur Conan Doyle, (1890)
Back in the mid-to-late 1800s, many touted cocaine as a wonder drug, providing not only stimulation but a wonderful feeling of clarity as well. Doctors prescribed the drug in a seven percent solution of water. Although Watson did not approve, Sherlock Holmes felt the drug helped him focus and shut out the distractions of the real world. He came to regret his addiction in later novels, as cocaine moved out of the mainstream.
This story is about a different type of seven percent solution, with no cocaine involved. Rather, we will be talking about the impact of another kind of stimulant, one that saves a surprising amount of review time and costs. This is the story of how a seemingly small improvement in review richness can make a big difference for your e-discovery budget. Continue reading
A critical metric in Technology Assisted Review (TAR) is recall, which is the percentage of relevant documents actually found from the collection. One of the most compelling reasons for using TAR is the promise that a review team can achieve a desired level of recall (say 75% of the relevant documents) after reviewing only a small portion of the total document population (say 5%). The savings come from not having to review the remaining 95% of the documents. The argument is that the remaining documents (the “discard pile”) include so few that are relevant (against so many irrelevant documents) that further review is not economically justified. Continue reading
I am sad to report that Browning Marean passed away last Friday. He will be sorely missed by his partners at DLA Piper, his clients and his many friends and colleagues. I am proud to say that I have been friends with Browning for many years and count myself in his fan club. We go back to the early days of Catalyst and before that even. The time was too short.
Browning served on the Catalyst Advisory Board for the past two years and was always quick to help whenever I asked. You couldn’t ask for a better sounding board or friend.
Many have already posted their thoughts and regrets about the loss of Browning, including our friend Craig Ball, who as usual made the case as eloquently as possible (Browning Marean 1942-2014). Thanks to Chris Dale as well for his comments, Goodbye Old Friend: Farewell to Browning Marean, and photo gallery. And to Ralph Losey: Browning Marean: The Life and Death of a Great Lawyer. And Tom O’Connor: Browning Marean: A Remembrance.
Browning and I go back to the early days, before there was an “E” in front of discovery. He told me once that he got his start on the speaking circuit after hearing one of my talks. It inspired him to see a lawyer up there talking about litigation technology, he said. Having watched Browning leave me in the dirt with his speaking prowess, I was both honored and pleased to have played a small part in getting him going.
I had the privilege of being with Browning on the dais, at conferences and in quiet evening meals from Hong Kong to London and many places in between. Had I realized time was short, there are so many things I would have wanted to say. Alas, that seldom happens and it didn’t here. He wrote me a few weeks ago to say he expected to be back on his feet in September. How I wish that were still true.
Browning: You touched a lot of people over your too few years and made the world a better place. We carry on in your honor.
Rest in peace old friend.
Last month, two of the leading experts on e-discovery, Maura R. Grossman and Gordon V. Cormack, presented a peer-reviewed study on continuous active learning to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM), “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.”
In the study, they compared three TAR protocols, testing them across eight different cases. Two of the three protocols, Simple Passive Learning (SPL) and Simple Active Learning (SAL), are typically associated with early approaches to predictive coding, which we call TAR 1.0. The third, continuous active learning (CAL), is a central part of a newer approach to predictive coding, which we call TAR 2.0. Continue reading
Maura Grossman and Gordon Cormack just released another blockbuster article, “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’” 7 Federal Courts Law Review 286 (2014). The article was in part a response to an earlier article in the same journal by Karl Schieneman and Thomas Gricks, in which they asserted that Rule 26(g) imposes “unique obligations” on parties using TAR for document productions and suggested using techniques we associate with TAR 1.0 including: Continue reading
This past weekend I received an advance copy of a new research paper prepared by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” They have posted an author’s copy here.
The study attempted to answer one of the more important questions surrounding TAR methodology: Continue reading
I read with great interest a recent article in Law Technology News, “Four Examples of Predictive Coding Success,” by Barclay T. Blair.
The purpose of the article was to report on several successful uses of technology-assisted review. While that was interesting, my attention was drawn to another aspect of the report. Three of the case studies provided data shedding further light on that persistent e-discovery mystery: “How many documents in a gigabyte?” Continue reading