Whether at Sedona, Georgetown, Legaltech or any other of the many discovery conferences one might attend, a common debate centers on the efficacy of keyword search. “Keyword search is dead,” some argue, touting the effectiveness of the newer predictive analytics engines. “Long live keyword search,” comes back in return from lawyers who have relied on it for decades both to find legal precedent and, more recently, relevant documents for their cases.
Often, the critics of keyword searching cite the 1985 Blair and Maron study for the Association of Computing Machinery that suggested that full-text retrieval systems brought back only 20 percent of the relevant documents. That assertion is true but I wonder how many of the debaters have ever read the study itself. My guess is not many, including me. So I decided to give it a read. Continue reading
Recently, Bob Ambrogi, our director of communications, published a post called “Our 10 Most Popular Blog Posts of 2015 (So Far).” To my surprise, one of my 2011 posts topped the list: “Shedding Light on an E-Discovery Mystery: How Many Documents in a Gigabyte?” Another on the same topic ranked fourth: “How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question.”
Hmmm. Clearly, a lot of us are interested in knowing the answer to this question. I have received a number of comments on both posts (both in writing and in conversation), which always makes the writing worthwhile. The RAND people told me they also found my findings of interest when they were putting together their study on e-discovery costs. Continue reading
A key debate in the battle between TAR 1.0 (one-time training) and TAR 2.0 (continuous active learning) is whether you need a “subject matter expert” (SME) to do the training. With first-generation TAR engines, this was considered a given. Training had to be done by an SME, which many interpreted as a senior lawyer intimately familiar with the underlying case. Indeed, the big question in the TAR 1.0 world was whether you could use several SMEs to spread the training load and get the work done more quickly.
SME training presented practical problems for TAR 1.0 users—primarily because the SME had to look at a lot of documents before review could begin. You started with a “control” set, often 500 documents or more, to be used as a reference for training. Then, the SME needed to review thousands of additional documents to train the system. After that, the SME had to review and tag another 500 documents to document effectiveness of the training. All told, the SME could expect to to look at and judge 3,000 to 5,000 or more documents before the review could start. Continue reading
I do not know if any leprechauns appeared in this case, but the Irish High Court found the proverbial pot of gold under the TAR rainbow in Irish Bank Resolution Corp. vs. Quinn—the first decision outside the U.S. to approve the use of Technology Assisted Review for civil discovery.
The protocol at issue in the March 3, 2015, decision was TAR 1.0 (Clearwell). For that reason, some of the points addressed by the court will be immaterial for legal professionals who use the more-advanced TAR 2.0 and Continuous Active Learning (CAL). Even so, the case makes for an interesting read, both for its description of the TAR process at issue and for its ultimate outcome. Continue reading
No actual birds were harmed in the making of this blog post!
Since the advent of Technology Assisted Review (aka TAR, predictive coding or computer-assisted review), one of the open questions is whether you have to run a separate TAR process for each item in a document request. As litigation professionals know, it is rare to have only one numbered request in a Rule 34 pleading. Rather, you can expect to see scores of requests (typically as many as the local rules allow). Continue reading
I have been on the road quite a bit lately, attending and speaking at several e-discovery events. Most recently I was at the midyear meeting of the Sedona Conference Working Group 1 in Dallas, and before that I was a speaker at both the University of Florida’s 3rd Annual Electronic Discovery Conference and the 4th Annual ASU-Arkfeld E-Discovery and Digital Evidence Conference.
In my travels and elsewhere, I continue to see a marked increase in talk about the new TAR 2.0 protocol, Continuous Active Learning (CAL). I have been seeing increasing interest in CAL ever since the July 2014 release of the Grossman/Cormack study, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” Continue reading
Technology assisted review has a transparency problem. Notwithstanding TAR’s proven savings in both time and review costs, many attorneys hesitate to use it because courts require “transparency” in the TAR process.
Specifically, when courts approve requests to use TAR, they often set the condition that counsel disclose the TAR process they used and which documents they used for training. In some cases, the courts have gone so far as to allow opposing counsel to kibitz during the training process itself. Continue reading
Our Summit partner, DSi, has a large financial institution client that had allegedly been defrauded by a borrower. The details aren’t important to this discussion, but assume the borrower employed a variety of creative accounting techniques to make its financial position look better than it really was. And, as is often the case, the problems were missed by the accounting and other financial professionals conducting due diligence. Indeed, there were strong factual suggestions that one or more of the professionals were in on the scam.
As the fraud came to light, litigation followed. Perhaps in retaliation or simply to mount a counter offense, the defendant borrower hit the bank with lengthy document requests. After collection and best efforts culling, our client was still left with over 2.1 million documents which might be responsive. Neither time deadlines nor budget allowed for manual review of that volume of documents. Keyword search offered some help but the problem remained. What to do with 2.1 million potentially responsive documents? Continue reading
In July 2014, attorney Maura Grossman and professor Gordon Cormack introduced a new protocol for Technology Assisted Review that they showed could cut review time and costs substantially. Called Continuous Active Learning (“CAL”), this new approach differed from traditional TAR methods because it employed continuous learning throughout the review, rather than the one-time training used by most TAR technologies.
Barbra Streisand in ‘A Star is Born’
Their peer-reviewed research paper, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” also showed that using random documents was the least effective method for training a TAR system. Overall, they showed that CAL solved a number of real-world problems that had bedeviled review managers using TAR 1.0 protocols.
Not surprisingly, their research caused a stir. Some heralded its common-sense findings about continuous learning and the inefficiency of using random seeds for training. Others challenged the results, arguing that one-time training is good enough and that using random seeds eliminates bias. We were pleased that it confirmed our earlier research and legitimized our approach, which we call TAR 2.0. Continue reading
In Part One of this two-part post, I introduced readers to statistical problems inherent in proving the level of recall reached in a Technology Assisted Review (TAR) project. Specifically, I showed that the confidence intervals around an asserted recall percentage could be sufficiently large with typical sample sizes as to undercut the basic assertion used to justify your TAR cutoff.
In our hypothetical example, we had to acknowledge that while our point estimate suggested we had found 75% of the relevant documents in the collection, it was possible that we found only a far lower percentage. For example, with a sample size of 600 documents, the lower bound of our confidence interval was 40%. If we increased the sample size to 2,400 documents, the lower bound only increased to 54%. And, if we upped our sample to 9,500 documents, we got the lower bound to 63%.
Even assuming that 63% as a lower bound is enough, we would have a lot of documents to sample. Using basic assumptions about cost and productivity, we concluded that we might spend 95 hours to review our sample at a cost of about $20,000. If the sample didn’t prove out our hoped-for recall level (or if we received more documents to review), we might have to run the sample several times. That is a problem.
Is there a better and cheaper way to prove recall in a statistically sound manner? In this Part Two, I will take a look at some of the other approaches people have put forward and see how they match up. However, as Maura Grossman and Gordon Cormack warned in “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review’” and Bill Dimm amplified in a later post on the subject, there is no free lunch. Continue reading