Author Archives: Jeremy Pickens

The Five Myths of Technology Assisted Review, Revisited

Tar PitOn Jan. 24, Law Technology News published John’s article, “Five Myths about Technology Assisted Review.” The article challenged several conventional assumptions about the predictive coding process and generated a lot of interest and a bit of dyspepsia too. At the least, it got some good discussions going and perhaps nudged the status quo a bit in the balance.

One writer, Roe Frazer, took issue with our views in a blog post he wrote. Apparently, he tried to post his comments with Law Technology News but was unsuccessful. Instead, he posted his reaction on the blog of his company, Cicayda. We would have responded there but we don’t see a spot for replies on that blog either. Continue reading

In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey)

In a recent blog post, Ralph Losey tackles the issue of expertise and TAR algorithm training.  The post, as is characteristic of Losey’s writing, is densely packed.  He raises a number of different objections to doing any sort of training using a reviewer who is not a subject matter expert (SME).  I will not attempt to unpack every single one of those objections.  Rather, I wish to cut directly to the fundamental point that underlies the belief in the absolute necessity that an SME, and only an SME, should provide the judgments, the document codings, that get used for training: Continue reading

Predictive Ranking: Technology Assisted Review Designed for the Real World

Why Predictive Ranking?

Most articles about technology assisted review (TAR) start with dire warnings about the explosion in electronic data. In most legal matters, however, the reality is that the quantity of data is big, but it is no explosion. The fact of the matter is that even a half million documents—a relatively small number in comparison to the “big data” of the web—pose a significant and serious challenge to a review team. That is a lot of documents and can cost a lot of money to review, especially if you have to go through them in a manual, linear fashion. Catalyst’s Predictive Ranking bypasses that linearity, helping you zero-in on the documents that matter most. But that is only part of what it does.

In the real world of e-discovery search and review, the challenges lawyers face come not merely from the explosion of data, but also from the constraints imposed by rolling collection, immediate deadlines, and non-standardized (and at times confusing) validation procedures. Overcoming these challenges is as much about process and workflow as it is about the technology that can be specifically crafted to enable that workflow. For these real-world challenges, Catalyst’s Predictive Ranking provides solutions that no other TAR process can offer. Continue reading

In Search, Evaluation Drives Innovation; Or, What You Cannot Measure You Cannot Improve

Information retrieval researchers at Shonan last week. That’s me in the center, wearing the yellow T-shirt.

Last week, I was honored to join a small group of information-retrieval researchers from around the world, from both industry and academia, who gathered at the Shonan Village Center in Kanagawa, Japan, to discuss issues surrounding the evaluation of whole session, interactive information retrieval. In this post, I introduce the purpose of this meeting. In later posts, I hope to further review the discussions that took place at Shonan and my own impressions. Continue reading

Search Q&A: How ECA is ‘Broken’ and the Solution that will Fix It

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?

DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA.

Let’s be more specific: By ECA, I am referring to the practice of culling down a collection of unstructured documents–often by completely removing 50% of the documents or more–prior to going into active document searching and review. This practice is often carried out by using metadata (such as date or author), keywords or concepts, and removing documents that contain certain “obviously” non-relevant terms.

In theory, the idea is fantastic. It greatly reduces the cost of both hosting and of reviewing. Why search or review documents that are obviously non-relevant? Why not cut out as much as possible beforehand, so as to make the manual, labor-intensive stage as easy as possible? Web search engines do something similar; they have primary and secondary indexes. Content most likely to be relevant and useful to their users gets fed into the primary index. Content that is less relevant, or that looks like spam, remains in the secondary index. In this manner, the primary indexes are made smaller and faster, making the overall search process much better.

However, there is a key difference between ECA and the web engine practice of primary and secondary indexing.  In ECA, there is no secondary index. Documents that have been judged non-relevant on the sole basis of a few keywords or concepts or metadata are simply removed from the process completely, never to be revisited. Therein lies the problem. Continue reading

Search Q&A: Learning to Read the ‘Signals’ Within Document Collections

By and | This entry was posted in Search, Search Q&A and tagged , on by .

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: What are “signals” and how can they improve search?

DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).

Smoke Signals by Frederic RemingtonBy itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.

On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential. Continue reading

Search Q&A: The Six Blind Men and the E-Discovery Elephant

By and | This entry was posted in Search, Search Q&A and tagged , on by .

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of search algorithms out there. Why do you feel that collaboration is a better way to search?

DR. JEREMY PICKENS: Collaboration is a better way to search because e-discovery is not all about the algorithms. Algorithms also involve people.

In a previous post (Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?), I talked about why there will never be a magic bullet for e-discovery. That primarily has to do with the fact that an information need is typically never satisfied with just a single document, as it often is in web search. Rather, in e-discovery, hundreds and thousands of responsive documents must be found. Continue reading

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need. Continue reading

The Recommind Patent and the Need to Better Define ‘Predictive Coding’

Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL).  This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm).  The purpose of the workshop was, in part:

…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches. Continue reading

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review. Continue reading