[Editor’s note: This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]
We received this question:
What are the thresholds (in numbers of docs) at which your company will recommend the use of predictive coding? Would this be case dependent or just a percentage of documents (e.g. 100 out of 1,000 documents giving us 10%)?
Today’s question is answered by Dr. Jeremy Pickens, senior applied research scientist. Continue reading
I have noticed that in certain popular document-based systems in the e-discovery marketplace, there is a feature (a capability) that often gets touted. Although I am a research scientist at Catalyst, I have been on enough sales calls with my fellow Catalyst team members to have heard numerous users of document-based systems ask whether or not we have the capability to automatically remove common headers and footers from email. There are document-based systems that showcase this capability as a feature that is good to have, so clients often include it in the checklist of capabilities that they’re seeking.
This leads me to ask: Why?
For the longest time, this request confused me. It was a capability that many have declared that they need, because they saw that it existed elsewhere. That leads me to want to discuss the topic of holistic thinking when it comes to one’s technology assisted review (TAR) algorithms and processes. Continue reading
In a recent blog post, Ralph Losey lays out a case for the abolishment of control sets in e-discovery, particularly if one is following a continuous learning protocol. Here at Catalyst, we could not agree more with this position. From the very first moment we rolled out our TAR 2.0, continuous learning engine we have not only recommended against the use of control sets, but we actively decided against ever implementing them in the first place and thus never even had the potential of steering clients awry.
Losey points out three main flaws with control sets. These may be summarized as (1) knowledge Issues, (2) sequential testing bias, and (3) representativeness. In this blog post I offer my own take and evidence in favor of these three points, and offer a fourth difficulty with control sets: rolling collection. Continue reading
There has been a bit of talk lately in the e-discovery echo chamber about fixed-price models for processing, hosting, review and productions. The purported goal of this discussion was to create a stir and drum up business. Yet conspicuously absent from this entire discussion was talk of total cost, aka value. I am the research scientist at Catalyst, so typically I do not get involved in discussions like this. However, as there still seems to be a great deal of confusion over value, I felt the need to help sort all this out.
First, a bit of my background. I have spent the last 18 years of my professional life developing and applying algorithms to the task of finding relevant information. Currently, I am the senior applied research scientist at Catalyst. I obtained my Ph.D. in computer science with a focus on information retrieval (search engines) from the Center for Intelligent Information Retrieval (CIIR) at UMass Amherst in 2004. I did a postdoc at King’s College University of London and then spent five years at the Fuji Xerox research lab in Palo Alto (FXPAL) before joining Catalyst in 2010. Continue reading
Before joining Catalyst in 2010, my entire academic and professional career revolved around basic research. I spent my time coming up with new and interesting algorithms, ways of improving document rankings and classification. However, in much of my research, it was not always clear which algorithms which may or may not have immediate application. It is not that the algorithms were not useful; they were. They just did not always have immediate application to a live, deployed system.
Since joining Catalyst, however, my research has become much more applied. I have come to discover that doesn’t just mean that the algorithms that I design have to be more narrowly focused on the existing task. It also means that I have to design those algorithms to be aware of the larger real world contexts in which those algorithms will be deployed and the limitations that may exist therein.
So it is with keen interest that I have been watching the eDiscovery world react to the recent (SIGIR 2014) paper from Maura Grossman and Gordon Cormack on the CAL (continuous active learning) protocol, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery. Continue reading
On Jan. 24, Law Technology News published John’s article, “Five Myths about Technology Assisted Review.” The article challenged several conventional assumptions about the predictive coding process and generated a lot of interest and a bit of dyspepsia too. At the least, it got some good discussions going and perhaps nudged the status quo a bit in the balance.
One writer, Roe Frazer, took issue with our views in a blog post he wrote. Apparently, he tried to post his comments with Law Technology News but was unsuccessful. Instead, he posted his reaction on the blog of his company, Cicayda. We would have responded there but we don’t see a spot for replies on that blog either. Continue reading
In a recent blog post, Ralph Losey tackles the issue of expertise and TAR algorithm training. The post, as is characteristic of Losey’s writing, is densely packed. He raises a number of different objections to doing any sort of training using a reviewer who is not a subject matter expert (SME). I will not attempt to unpack every single one of those objections. Rather, I wish to cut directly to the fundamental point that underlies the belief in the absolute necessity that an SME, and only an SME, should provide the judgments, the document codings, that get used for training: Continue reading
Why Predictive Ranking?
Most articles about technology assisted review (TAR) start with dire warnings about the explosion in electronic data. In most legal matters, however, the reality is that the quantity of data is big, but it is no explosion. The fact of the matter is that even a half million documents—a relatively small number in comparison to the “big data” of the web—pose a significant and serious challenge to a review team. That is a lot of documents and can cost a lot of money to review, especially if you have to go through them in a manual, linear fashion. Catalyst’s Predictive Ranking bypasses that linearity, helping you zero-in on the documents that matter most. But that is only part of what it does.
In the real world of e-discovery search and review, the challenges lawyers face come not merely from the explosion of data, but also from the constraints imposed by rolling collection, immediate deadlines, and non-standardized (and at times confusing) Continue reading
[This is another in a series of search Q&As between Bruce Kiefer, Catalyst’s director of research and development, and Dr. Jeremy Pickens, Catalyst’s senior applied research scientist.]
BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?
DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA. Continue reading