E-Discovery Search Blog

Jeremy Pickens

About Jeremy Pickens

Jeremy Pickens, Ph.D., is one of the world's leading search scientists and a pioneer in the field of collaborative exploratory search, a form of search in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has six patents pending in the field of search and information retrieval, including two for collaborative exploratory search systems.

At Catalyst, Dr. Pickens researches and develops methods of using collaborative search to achieve more intelligent and precise results in e-discovery search and review. He also studies other ways to enhance search and review within the Catalyst system.

Dr. Pickens earned his master's and doctoral degrees at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King's College, London, on a joint grant with Goldsmiths University of London. As part of the OMRAS project (Online Music Recognition and Searching), he helped organize the first Music Information Retrieval (ISMIR) conference in Plymouth, Mass. Before joining Catalyst, Dr. Pickens spent five years as a research scientist at FX Palo Alto Lab, where his major research themes included video search and collaborative exploratory search.

Dr. Pickens is co-author of the forthcoming book, A Taxonomy of Collaborative Information Seeking, to be published by Morgan & Claypool Publishers. He was an editor of the spring 2010 special issue on collaborative information seeking of the journal Information Processing and Management. He is a frequent author and speaker on the topic.

Search Q&A: How ECA is ‘Broken’ and the Solution that will Fix It

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?

DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA.

Let’s be more specific: By ECA, I am referring to the practice of culling down a collection of unstructured documents–often by completely removing 50% of the documents or more–prior to going into active document searching and review. This practice is often carried out by using metadata (such as date or author), keywords or concepts, and removing documents that contain certain “obviously” non-relevant terms.

In theory, the idea is fantastic. It greatly reduces the cost of both hosting and of reviewing. Why search or review documents that are obviously non-relevant? Why not cut out as much as possible beforehand, so as to make the manual, labor-intensive stage as easy as possible? Web search engines do something similar; they have primary and secondary indexes. Content most likely to be relevant and useful to their users gets fed into the primary index. Content that is less relevant, or that looks like spam, remains in the secondary index. In this manner, the primary indexes are made smaller and faster, making the overall search process much better.

However, there is a key difference between ECA and the web engine practice of primary and secondary indexing.  In ECA, there is no secondary index. Documents that have been judged non-relevant on the sole basis of a few keywords or concepts or metadata are simply removed from the process completely, never to be revisited. Therein lies the problem.

I am an information-retrieval research scientist. One of the core precepts in my field is that a document will be relevant for only a few, very specific reasons, but non-relevant for dozens if not hundreds of reasons. The cor0llary to this is that there are many more keywords and concepts found in non-relevant documents that are also found in relevant documents than vice versa. That is, there is a higher probability that a keyword or concept found in a non-relevant document will also be found in a relevant document.

So what does that mean for ECA?  The problem arises if you are using keywords and concepts to filter out non-relevant documents without actually assessing them for relevance (i.e. without actually doing review). In that case, there is
a strong danger that the keywords and concepts you are using to do the filtering are also removing a number of relevant documents. And because you’re not doing what the web search engines do–creating a secondary index that can be revisted at a later point in time–but instead are completely removing those ECA’d documents from all further search and review, you’re losing those relevant documents forever.

When a Slam Dunk is a Smoking Gun

For example, one might be tempted to use ECA tools to filter out all documents that contain the terms “football,” “touchdown,” “49ers,” “Lakers,” “slam dunk,” “foul shot,” etc. Clearly these are all sports references and (let’s presume) sports emails are not relevant to the matter at hand but rather part of background office chatter. However, suppose the collection contains an email that says, “Cindy, I was able to reverse engineer competitor X’s code. I think this should make our new product offering a total slam dunk!” Or there might be another email that says, ”Hey, Jim, want to meet at the Tied House brew pub and catch the 49ers game after work on Monday? We can discuss our plans to fix the price of pork bellies.”

If the terms “49ers” and “slam dunk” have already been used during the ECA phase to completely remove every document that contains them, then these critical documents will be completely missed, putting the litigant at severe risk.

The solution, therefore, is to employ ECA in a manner that does not completely obliterate documents. Instead, ECA should be a tool for shifting certain sets of documents to a lower retrieval priority, a lower review priority or a secondary index. All of the documents should still be available. ECA simply helps with an intelligent prioritization of the searching and reviewing of those documents.

This approach allows the primary review to continue on as usual, with all the advantages of a pre-culled smaller number of documents. But if certain terms get discovered as part of that primary review process–terms such as “reverse engineer” or “pork bellies”–those terms can be used as queries into the secondary index. Then, the documents talking about meeting at the brew pub to watch the 49ers game and discuss the price fixing of pork bellies can still be recovered, despite having been pre-culled at an early stage. At the same time, if those ECA’d documents don’t contain ”pork bellies,” they still remain in the secondary index and do not disrupt the efficiency and effectiveness of the primary index. It is the best of both worlds.

In short, the problem with ECA today is that it draws hard boundaries–it makes permanent decisions about documents when it really shouldn’t. The solution is to make those boundaries softer, to treat ECA as a prioritization tool, or as a mechanism for shifting documents into tiered secondary and even tertiary indexes. In that manner, poor decisions made early on in the process, under the blindness of an ECA process, are not made permanent. They can be easily, automatically and effectively corrected.

Search Q&A: Learning to Read the ‘Signals’ Within Document Collections

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: What are “signals” and how can they improve search?

DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).

Smoke Signals by Frederic RemingtonBy itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.

On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential.

But signals can also be more abstract. Take the example of whether the 17th word in any particular document is capitalized. Generally, knowing this is probably not useful. But what if you knew that 30 of the past 35 documents that have been marked as responsive all contain a capitalized word at the 17th position and none of the non-responsive documents do? If you are able to identify that signal, then the signal can be amplified within the search algorithm itself so as to steer you towards additional documents with the same signal.

Signal selection, or determining which signals to measure and track, is an open problem. It is often domain dependent, if not matter dependent. There are some generally useful signals, such as word presence, word frequency, anchortext hyperlinks (in the case of web documents) or to/from “hyperlinks” (in the case of email). But determining what other signals to employ involves a mixture of intuition, mathematics, and experimentation. When it is done correctly, though, it yields huge gains in ranking algorithm effectiveness.

Search Q&A: The Six Blind Men and the E-Discovery Elephant

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of search algorithms out there. Why do you feel that collaboration is a better way to search?

DR. JEREMY PICKENS: Collaboration is a better way to search because e-discovery is not all about the algorithms. Algorithms also involve people.

In a previous post (Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?), I talked about why there will never be a magic bullet for e-discovery. That primarily has to do with the fact that an information need is typically never satisfied with just a single document, as it often is in web search. Rather, in e-discovery, hundreds and thousands of responsive documents must be found.

When there is that much information, it can be quite beneficial to have more than one person’s viewpoint. Every query is a different hypothesis about what is relevant, a different probe into the collection. More people working together means more viewpoints, which translate into a wider variety of probes.

An algorithm that is multi-searcher aware and tries to reconcile (look for both similarities among and gaps between) the various searcher activities is going to do a better job than an algorithm that only comes at the problem from one viewpoint.

Think of it with reference to that old story of the six blind men who wanted to know what an elephant looked like. The first man touched the elephant’s leg and declared, “The elephant is a pillar.” The second touched its tail and described it as like a rope. The third felt the trunk and said it was like a tree branch. The fourth felt the ear and thought it was like a big fan. The fifth touched the belly and asserted it was a thick wall. The sixth felt the tusk and contended the elephant was like a solid pipe.

Seeing that the blind men could not agree on what the elephant looked like, a passing wise man explained, “All of you are right. The reason every one of you is telling it differently is because each one of you touched a different part of the elephant. Actually, the elephant has all the features each of you found.”

In a sense, e-discovery search is like those blind men’s search of an elephant. Provided the searchers work collaboratively, then as each searcher touches and interprets a part, eventually the whole elephant emerges. In search, therein lies the benefit of collaboration.

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need.

Article Collaboration and Improvement DriveWith collaborative search, on the other hand, you are a member of a team consisting of at least one other person, possibly more. You are actively working together with that person to satisy a jointly held information need. My favorite example is of a couple looking to find a house or apartment. It does not help you to know that “people who bought this house also bought that house,” or that “people who live in this apartment also have lived in that apartment.” You are not going to move in together with all those people. You are going to move in with your partner.

And so as you are both searching for places to live, each of you enters different criteria about what is and is not important to you. You might like to live somewhere with great southern-facing exposure. Your partner might like a place with a garden. You might like a kitchen on the upper floor, and your partner might like enough work space in which to tinker on her motorcycle. A collaborative information seeking system should then attempt to find houses or apartments that satisfy both of your needs, jointly and simultaneously.

It is my belief that collaborative information seeking is much more appropriate to e-discovery than is collaborative filtering. Imagine collaborative filtering (“people who bought this also bought that”) in an e-discovery context: “People who have judged this document as responsive have also judged that document as responsive.” Of what value is it to know this? Given that someone else has already judged the document as responsive, why do I need to look at it? Unless I am doing quality control, it is simply a waste of time and client resources for the reviewer to judge again a document that has already been judged. Collaborative filtering falls apart in the e-discovery context, as it yields unnecessary repetition of labor. Collaborative filtering might work very well for toaster ovens, as you will still buy the toaster oven even if your friend has already bought the same model. It does not work well for e-discovery, as there is no sense in judging a document if your “friend” has already judged it.

By contrast, this is where collaborative search shines. Collaborative search allows you to find information that has not been viewed/judged/assessed by any member of your team of two or more people, but that is jointly relevant to the task that you are all working on, together. Collaborative search allows you and your team members jointly to push deeper into the collection, to documents that none of you would have likely found, were you working alone. Just as collaborative search allows you to find that house or apartment with both the southern exposure as well as the motorcycle workshop, it allows you to find documents that satisfy both the lead counsel’s as well as the review manager’s understanding of the task.

The Recommind Patent and the Need to Better Define ‘Predictive Coding’

Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL).  This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm).  The purpose of the workshop was, in part:

…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches.

Hot on the list of topics, of course, was predictive coding.  Much of the discussion centered around determining exactly what standards were needed not only to convince users of such systems that non-linear, smart review would save them time and money, but also to convince the courts (and lawyers who don’t want to receive sanctions from the courts) that such technology may be safely applied to a matter at hand while still meeting all the legal requirements of discovery.

So it was with keen interest that I noted the press release from a vendor, Recommind, that it had obtained a patent on the process of predictive coding itself.  Having been involved in writing a few patents in my time, my immediate thought was, “What exactly was patented, what are the specific claims? Is this going to be a broad patent, covering a high level process?  Or is it going to be a narrow patent, covering one or two specific ways of doing predictive coding?”

So I read the patent, and I read Recommind’s explanation, and I read the commentary, including Barry Murphy’s post, Dawn of the Predictive Coding Wars. First, from Murphy’s commentary:

According to Craig, the press release is “about more than terminology: it is about a process patent covering ‘systems and processes’ for iterative, computer-assisted review. Recommind believes it has long been on the record as to exactly what predictive coding is, and as a result of this patent, it expects competing vendors to follow suit accordingly, and stop claiming predictive coding capabilities they do not have.” Clearly, Recommind feels it has pioneered the concept of predictive coding and doesn’t want any competitors riding on coattails.

Second, from the explanation:

Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. …

The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan.  But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part.  Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.

OK, so predictive coding as a whole (and therefore the patent on predictive coding) is not a single technology, so much as it is a “process, workflow, and documentation.” Fine; I’ll accept that. However, nowhere in this post entitled “Predictive Coding Explained” were the process, workflow and documentation really ever explained. Great pain was taken to say what predictive coding was not (e.g. threading, clustering, etc. – which I agree with).   But no actual logical sequence of steps was given as to what predictive coding, at least from the perspective of this patent, was supposed to be.

For that, I had to turn to the patent itself. See Figure 5 in the patent (above), labeled “Predictive Coding Workflow.” See also Claim #1 (the top level independent patent claim).  That claim says that the patent covers a method for analyzing a plurality of documents, comprising:

(1) Receiving the plurality of documents via a computing device

(2) Receiving user input from the computing device, the user input including hard coding [aka labeling] of a subset of the plurality of documents, the hard coding based on an identified subject or category [e.g. responsiveness, privilege, or issue]

(3) Executing instructions stored in memory, that:

(a) generates an initial control set based on the subset of the plurality of documents and the received user input on the subset

(b) analyzes the initial control set to determine at least one seed set parameter associated with the identified subject or category

(c ) automatically codes a first portion of the plurality of documents, based on the initial control set and the at least one set seed parameter associated with the identified subject or category

(d) analyzes the first portion of the plurality of documents by applying an adaptive identification cycle, the adaptive identification cycle being based on the initial control set, user validation of the automatic coding of the first portion of the plurality of documents and confidence threshold validation

(e ) retrieves a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle on the first portion of the plurality of documents

(f) adds further documents to the plurality of documents on a rolling load basis, and conducts a random sampling of initial control set documents both on a static basis and the rolling load basis

(4) receiving user input via the computing device, the user input comprising inspection, analysis and hard coding of the randomly sampled initial control set documents, and

(5) executing instructions stored in memory , wherein execution of the instructions by the processor automatically codes documents based on the received user input regarding the randomly sampled initial control set documents

So that appears to be the primary workflow, the primary patented claim.  Let’s compare and contrast that workflow with that of traditional relevance feedback. Though relevance feedback dates back to the early 1970s, here is a passage from the Introduction to Information Retrieval (published in 2008) describing the basic workflow:

The idea of relevance feedback is to involve the user in the retrieval process so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is:

  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval results.
  • The user marks some returned documents as relevant or nonrelevant.
  • The system computes a better representation of the information need based on the user feedback.
  • The system displays a revised set of retrieval results.

Relevance feedback can go through one or more iterations of this sort.

In other words, the relevance feedback workflow seems to do everything that the predictive coding workflow does.  It starts with a collection of documents. It selects a subset of those documents in some manner.  It presents those documents to a human annotator for expert labeling. Based on the labels provided by the human, the algorithm goes through an “adaptive identification cycle” in which it modifies itself so as to better align itself with the human understanding of the document labels. And, based on this adapted algorithm, it revises the set of results. That is, it recomputes the probabilities of the labels (relevance or nonrelevant, responsive or nonresponsive) for all the results.  Finally, it should be noted that the traditional, decades-old relevance feedback process workflow also is capable of iteration.

So what is the difference? I don’t just ask this rhetorically. I see a very strong similarity in the overall workflows between both predictive coding and relevance feedback, so I would honestly and transparently like to understand where the crucial differences are. If we are to understand what Recommind believes predictive coding to be–and if this understanding is going to help the courts set the legal precedent for defensible use of these technologies, a goal in which I fully agree with Recommind–then we really need to understand the process as a whole and what makes it unique.

The only thing I can think of is that there are a few occasions in the claimed predictive coding workflow that integrate random sampling and this is most likely to insure that the process is defensible. If that is the case, then how does that differ from active learning? Here is an example of the active learning workflow which incorporates uncertainty-based sampling, from a 2007 academic research paper by Andreas Vlachos, “A Stopping Criterion for Active Learning“:

Input:

seed labelled data L, unlabelled data U,

batch size b

Initialization:

Train a model on L

Active Learning Loop:

Until a stopping criterion is satisfied:

Apply the trained model classifier on U

Rank the instances in U using the uncertainty of the model

Annotate the top b instances and add them to L

Train the model on the expanded L

That is, instead of just presenting the expert user (e.g. lawyer) with the documents that have the highest probability of responsiveness, or of privilege, or of whatever issue they’ve been coded for, an active learning process or workflow explicitly seeks to add those document instances about which the learning algorithm is the most uncertain. That could mean documents for which the probability of that document’s label is relatively even or undistinguished (highest entropy) across all classes (in the case of generative machine learning models) or documents which lie the nearest to a decision boundary (in the case of discriminative machine learning models).

However, it could also mean that a document doesn’t lie near any boundary or have any probability estimate associated with it, because the appropriate signals have not yet been added to the model. In such cases, the best way–nay even the only way–of doing uncertainty sampling is to randomly sample from the collection, as random sampling helps you discover those documents, and therefore those decision boundaries, that you otherwise would not be aware of.  Thus, active learning as a general workflow pattern also incorporates random sampling.

So again, it is still not clear to me exactly what makes the Recommind predictive coding workflow unique, what distinguishes it from methods that have gone before, what its core characteristics are.  That isn’t to say that they don’t exist.  However, I believe further discussion is warranted, both in public as well as at workshops such as DESI (http://www.umiacs.umd.edu/~oard/desi4/), as this will serve to advance the market as a whole.  That is, I agree with Barry Murphy over at eDiscovery Journal that:

No matter what, this is good news for the eDiscovery market as a whole.  One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what “predictive coding” capabilities they do/do not have. While that might be a side-effect, it’s more likely that Recommind is trying to take the heat around predictive coding and have it warm up the vendor’s prospects more than anything else. We at eDJ take this as a call to better define what predictive coding is and what solutions need to offer to be valuable.

I take this as a call for vendors not only to define exactly what “predictive coding” capabilities they do/do not have, but for the industry as a whole to begin to set court-friendly guidelines around what predictive coding truly is.

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.

Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Information retrieval is discipline from the 1970s. Relational databases arrived in the 1960s. Most e-discovery platforms combine full text search (from information retrieval) and a relational database. What do you think is new and exciting in the world of e-discovery with tools that are 40 and 50 years old? Do you think there is a magic algorithm that will be used in e-discovery that will be as disruptive as Google PageRank was for broad Internet searching?

JEREMY PICKENS: There are a number of different angles from which one could approach this question. Recall from a previous blog post that one of the primary distinguishing factors between web search and e-discovery search is that the former is geared toward finding the one best answer, such as a factoid or a home page (precision-oriented), whereas the latter typically requires thousands if not millions of relevant (responsive) documents in order to satisfy an information need. This difference is not insignificant; it changes the entire nature of the search system being designed to meet that need.

Take PageRank, as per your example. It is important to understand that what makes PageRank work so well for web-oriented search has as much (if not more) to do with what the user is trying to accomplish as it does with the algorithm itself. Stop for a moment and read that sentence again. Web users typically want a single, best answer. And quickly. What is the best way to satisfy that information need? It is to give the web searcher a result that a lot of other people already think is pretty good, e.g. “votes” that come in the form of link data. If enough web pages link to single web page and use topically relevant keywords in that link’s anchortext, that web page will be boosted in the rankings. That web page will be “voted” to the top.

More to the point: The specific algorithm that is used to count those votes is not as important as simply having the votes in the first place. Having the votes is what moves your page from page 57 of the results to page 1. A better algorithm might move the page to rank #2 on page 1, rather than rank #9 on page 1. But 90% of what got that document to page 1 was the votes themselves, rather than the mathematics of how the votes were counted. And simply being on page 1 accounts for 90% of the success of PageRank, as typical web searchers will only look at the first page of results and almost never further.

In summary, it is not so much the PageRank algorithm (mathematics) that makes PageRank so successful. It is the signal (link “votes”) used as input to the algorithm; the signal correlates well with the ultimate user goal.

So the question is whether there will ever be a magic algorithm for e-discovery that will be as disruptive as PageRank. This is the same as asking whether there will ever be a single signal (such as a link “vote”) that correlates well with the user goal or intention. At the risk of making too bold of a claim, I think that the answer is no.

Jeremy Pickens

An e-discovery searcher’s information need simply does not fit the “magic bullet” profile. Someone engaged in e-discovery does not look at the first page of results and stop. That person (or a team of reviewers) may look at 20 pages. Or 100 pages. So whether one of the many available relevant documents is on page 1 or on page 57 matters much less. The user information need does not match what PageRank — or PageRank-like magic bullet algorithms — is trying to do.

Magic bullet algorithms try to get the absolute single best result (or small handful of few results) to the very top of the list. E-discovery users need thousands or millions of relevant results. And when there is that much information, there is going to be a huge diversity of signals and coordination between dozens of various algorithms to exhaustively find everything.

Please note, however, that this does not mean algorithmic approaches will not work for e-discovery. Quite the contrary; e-discovery is in need of more, better and smarter algorithms. And these algorithms will improve our ability and capacity to meet the e-discovery challenge. It is just that the algorithms developed will not be “magic bullet” algorithms. They will be like a well-coordinated orchestra, with dozens of components playing together in unison.

(Image: Felipe Micaroni Lalli per Creative Commons.)

Search Q&A: How to Evaluate the Quality of an E-Discovery Search Platform

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Since e-discovery is already costly and time consuming, there doesn’t seem to be a good way for customers to compare offerings by running a case in different systems. Besides sales slicks, acronyms and generic testing such as TREC, how do you think customers should evaluate the quality of the platform they have chosen to handle e-discovery?

JEREMY PICKENS: This is a good question, and one to which there is no single, easy answer. That said, one possibility would be for the platform itself to give you internal metrics in the form of goal-oriented progress prediction. For example, if your goal is to find all responsive or privileged documents in a collection, a good platform should not only give you an estimate of how many more responsive documents it thinks are available to be found, but also let you track the history of that prediction before and after various events.

One should be able to get a sense of how right or wrong that prediction was, as one’s session-based information-seeking task progresses.

Specifically, if that estimate changes drastically after the execution of a new query, or after the responsiveness coding of a particular set of documents, that should be brought to the user’s attention. In a quality platform, it is less important that the platform get the prediction right at the very beginning, than it is the platform is forthcoming and transparent with its mistakes. This should allow the user to work concertedly with the platform toward the goal.

Note: Jeremy Pickens, Bruce Kiefer and John Tredennick have written a research paper that expands on this topic, Process Evaluation in eDiscovery as Awareness of Alternatives. Pickens will present the paper June 6 at the ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings (DESI IV).