[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]
BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?
DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA.
Let’s be more specific: By ECA, I am referring to the practice of culling down a collection of unstructured documents–often by completely removing 50% of the documents or more–prior to going into active document searching and review. This practice is often carried out by using metadata (such as date or author), keywords or concepts, and removing documents that contain certain “obviously” non-relevant terms.
In theory, the idea is fantastic. It greatly reduces the cost of both hosting and of reviewing. Why search or review documents that are obviously non-relevant? Why not cut out as much as possible beforehand, so as to make the manual, labor-intensive stage as easy as possible? Web search engines do something similar; they have primary and secondary indexes. Content most likely to be relevant and useful to their users gets fed into the primary index. Content that is less relevant, or that looks like spam, remains in the secondary index. In this manner, the primary indexes are made smaller and faster, making the overall search process much better.
However, there is a key difference between ECA and the web engine practice of primary and secondary indexing. In ECA, there is no secondary index. Documents that have been judged non-relevant on the sole basis of a few keywords or concepts or metadata are simply removed from the process completely, never to be revisited. Therein lies the problem.
I am an information-retrieval research scientist. One of the core precepts in my field is that a document will be relevant for only a few, very specific reasons, but non-relevant for dozens if not hundreds of reasons. The cor0llary to this is that there are many more keywords and concepts found in non-relevant documents that are also found in relevant documents than vice versa. That is, there is a higher probability that a keyword or concept found in a non-relevant document will also be found in a relevant document.
So what does that mean for ECA? The problem arises if you are using keywords and concepts to filter out non-relevant documents without actually assessing them for relevance (i.e. without actually doing review). In that case, there is
a strong danger that the keywords and concepts you are using to do the filtering are also removing a number of relevant documents. And because you’re not doing what the web search engines do–creating a secondary index that can be revisted at a later point in time–but instead are completely removing those ECA’d documents from all further search and review, you’re losing those relevant documents forever.
When a Slam Dunk is a Smoking Gun
For example, one might be tempted to use ECA tools to filter out all documents that contain the terms “football,” “touchdown,” “49ers,” “Lakers,” “slam dunk,” “foul shot,” etc. Clearly these are all sports references and (let’s presume) sports emails are not relevant to the matter at hand but rather part of background office chatter. However, suppose the collection contains an email that says, “Cindy, I was able to reverse engineer competitor X’s code. I think this should make our new product offering a total slam dunk!” Or there might be another email that says, ”Hey, Jim, want to meet at the Tied House brew pub and catch the 49ers game after work on Monday? We can discuss our plans to fix the price of pork bellies.”
If the terms “49ers” and “slam dunk” have already been used during the ECA phase to completely remove every document that contains them, then these critical documents will be completely missed, putting the litigant at severe risk.
The solution, therefore, is to employ ECA in a manner that does not completely obliterate documents. Instead, ECA should be a tool for shifting certain sets of documents to a lower retrieval priority, a lower review priority or a secondary index. All of the documents should still be available. ECA simply helps with an intelligent prioritization of the searching and reviewing of those documents.
This approach allows the primary review to continue on as usual, with all the advantages of a pre-culled smaller number of documents. But if certain terms get discovered as part of that primary review process–terms such as “reverse engineer” or “pork bellies”–those terms can be used as queries into the secondary index. Then, the documents talking about meeting at the brew pub to watch the 49ers game and discuss the price fixing of pork bellies can still be recovered, despite having been pre-culled at an early stage. At the same time, if those ECA’d documents don’t contain ”pork bellies,” they still remain in the secondary index and do not disrupt the efficiency and effectiveness of the primary index. It is the best of both worlds.
In short, the problem with ECA today is that it draws hard boundaries–it makes permanent decisions about documents when it really shouldn’t. The solution is to make those boundaries softer, to treat ECA as a prioritization tool, or as a mechanism for shifting documents into tiered secondary and even tertiary indexes. In that manner, poor decisions made early on in the process, under the blindness of an ECA process, are not made permanent. They can be easily, automatically and effectively corrected.
By itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.
When there is that much information, it can be quite beneficial to have more than one person’s viewpoint. Every query is a different hypothesis about what is relevant, a different probe into the collection. More people working together means more viewpoints, which translate into a wider variety of probes.
Take PageRank, as per your example. It is important to understand that what makes PageRank work so well for web-oriented search has as much (if not more) to do with what the user is trying to accomplish as it does with the algorithm itself. Stop for a moment and read that sentence again. Web users typically want a single, best answer. And quickly. What is the best way to satisfy that information need? It is to give the web searcher a result that a lot of other people already think is pretty good, e.g. “votes” that come in the form of link data. If enough web pages link to single web page and use topically relevant keywords in that link’s anchortext, that web page will be boosted in the rankings. That web page will be “voted” to the top.


