Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices
Bruce Kiefer

About Bruce Kiefer

Bruce Kiefer directs Catalyst's Research and Development Group, helping to develop the next generation of our technology, and is vice president of our Hosting Applications Division. He has worked in IT for many years, helping to build, deploy, manage, scale and repair networks and systems that solve problems.

Before joining Catalyst, he was vice president of operations for Viawest Internet Services. During Bruce's tenure at Viawest, he built many of the internal tools, grew the network to four states, and took over product management for Viawest's managed hosting offering.

In addition to his IT expertise, Bruce has a master's degree in business administration. He joined Catalyst in 2005, where he combines his knowledge of technology and business to help drive product development and build out operations.

Search Q&A: How ECA is ‘Broken’ and the Solution that will Fix It

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Early case assessment is a hot topic in electronic discovery. You believe that it may be flawed and cause additional errors. Why is that?

DR. JEREMY PICKENS: We’ve all heard the expression, “Don’t throw out the baby with the bath water.” Unfortunately, many e-discovery professionals risk doing exactly that in the way they are conducting ECA.

Let’s be more specific: By ECA, I am referring to the practice of culling down a collection of unstructured documents–often by completely removing 50% of the documents or more–prior to going into active document searching and review. This practice is often carried out by using metadata (such as date or author), keywords or concepts, and removing documents that contain certain “obviously” non-relevant terms.

In theory, the idea is fantastic. It greatly reduces the cost of both hosting and of reviewing. Why search or review documents that are obviously non-relevant? Why not cut out as much as possible beforehand, so as to make the manual, labor-intensive stage as easy as possible? Web search engines do something similar; they have primary and secondary indexes. Content most likely to be relevant and useful to their users gets fed into the primary index. Content that is less relevant, or that looks like spam, remains in the secondary index. In this manner, the primary indexes are made smaller and faster, making the overall search process much better.

However, there is a key difference between ECA and the web engine practice of primary and secondary indexing.  In ECA, there is no secondary index. Documents that have been judged non-relevant on the sole basis of a few keywords or concepts or metadata are simply removed from the process completely, never to be revisited. Therein lies the problem.

I am an information-retrieval research scientist. One of the core precepts in my field is that a document will be relevant for only a few, very specific reasons, but non-relevant for dozens if not hundreds of reasons. The cor0llary to this is that there are many more keywords and concepts found in non-relevant documents that are also found in relevant documents than vice versa. That is, there is a higher probability that a keyword or concept found in a non-relevant document will also be found in a relevant document.

So what does that mean for ECA?  The problem arises if you are using keywords and concepts to filter out non-relevant documents without actually assessing them for relevance (i.e. without actually doing review). In that case, there is
a strong danger that the keywords and concepts you are using to do the filtering are also removing a number of relevant documents. And because you’re not doing what the web search engines do–creating a secondary index that can be revisted at a later point in time–but instead are completely removing those ECA’d documents from all further search and review, you’re losing those relevant documents forever.

When a Slam Dunk is a Smoking Gun

For example, one might be tempted to use ECA tools to filter out all documents that contain the terms “football,” “touchdown,” “49ers,” “Lakers,” “slam dunk,” “foul shot,” etc. Clearly these are all sports references and (let’s presume) sports emails are not relevant to the matter at hand but rather part of background office chatter. However, suppose the collection contains an email that says, “Cindy, I was able to reverse engineer competitor X’s code. I think this should make our new product offering a total slam dunk!” Or there might be another email that says, ”Hey, Jim, want to meet at the Tied House brew pub and catch the 49ers game after work on Monday? We can discuss our plans to fix the price of pork bellies.”

If the terms “49ers” and “slam dunk” have already been used during the ECA phase to completely remove every document that contains them, then these critical documents will be completely missed, putting the litigant at severe risk.

The solution, therefore, is to employ ECA in a manner that does not completely obliterate documents. Instead, ECA should be a tool for shifting certain sets of documents to a lower retrieval priority, a lower review priority or a secondary index. All of the documents should still be available. ECA simply helps with an intelligent prioritization of the searching and reviewing of those documents.

This approach allows the primary review to continue on as usual, with all the advantages of a pre-culled smaller number of documents. But if certain terms get discovered as part of that primary review process–terms such as “reverse engineer” or “pork bellies”–those terms can be used as queries into the secondary index. Then, the documents talking about meeting at the brew pub to watch the 49ers game and discuss the price fixing of pork bellies can still be recovered, despite having been pre-culled at an early stage. At the same time, if those ECA’d documents don’t contain ”pork bellies,” they still remain in the secondary index and do not disrupt the efficiency and effectiveness of the primary index. It is the best of both worlds.

In short, the problem with ECA today is that it draws hard boundaries–it makes permanent decisions about documents when it really shouldn’t. The solution is to make those boundaries softer, to treat ECA as a prioritization tool, or as a mechanism for shifting documents into tiered secondary and even tertiary indexes. In that manner, poor decisions made early on in the process, under the blindness of an ECA process, are not made permanent. They can be easily, automatically and effectively corrected.

Search Q&A: Learning to Read the ‘Signals’ Within Document Collections

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: What are “signals” and how can they improve search?

DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).

Smoke Signals by Frederic RemingtonBy itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.

On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential.

But signals can also be more abstract. Take the example of whether the 17th word in any particular document is capitalized. Generally, knowing this is probably not useful. But what if you knew that 30 of the past 35 documents that have been marked as responsive all contain a capitalized word at the 17th position and none of the non-responsive documents do? If you are able to identify that signal, then the signal can be amplified within the search algorithm itself so as to steer you towards additional documents with the same signal.

Signal selection, or determining which signals to measure and track, is an open problem. It is often domain dependent, if not matter dependent. There are some generally useful signals, such as word presence, word frequency, anchortext hyperlinks (in the case of web documents) or to/from “hyperlinks” (in the case of email). But determining what other signals to employ involves a mixture of intuition, mathematics, and experimentation. When it is done correctly, though, it yields huge gains in ranking algorithm effectiveness.

Search Q&A: The Six Blind Men and the E-Discovery Elephant

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of search algorithms out there. Why do you feel that collaboration is a better way to search?

DR. JEREMY PICKENS: Collaboration is a better way to search because e-discovery is not all about the algorithms. Algorithms also involve people.

In a previous post (Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?), I talked about why there will never be a magic bullet for e-discovery. That primarily has to do with the fact that an information need is typically never satisfied with just a single document, as it often is in web search. Rather, in e-discovery, hundreds and thousands of responsive documents must be found.

When there is that much information, it can be quite beneficial to have more than one person’s viewpoint. Every query is a different hypothesis about what is relevant, a different probe into the collection. More people working together means more viewpoints, which translate into a wider variety of probes.

An algorithm that is multi-searcher aware and tries to reconcile (look for both similarities among and gaps between) the various searcher activities is going to do a better job than an algorithm that only comes at the problem from one viewpoint.

Think of it with reference to that old story of the six blind men who wanted to know what an elephant looked like. The first man touched the elephant’s leg and declared, “The elephant is a pillar.” The second touched its tail and described it as like a rope. The third felt the trunk and said it was like a tree branch. The fourth felt the ear and thought it was like a big fan. The fifth touched the belly and asserted it was a thick wall. The sixth felt the tusk and contended the elephant was like a solid pipe.

Seeing that the blind men could not agree on what the elephant looked like, a passing wise man explained, “All of you are right. The reason every one of you is telling it differently is because each one of you touched a different part of the elephant. Actually, the elephant has all the features each of you found.”

In a sense, e-discovery search is like those blind men’s search of an elephant. Provided the searchers work collaboratively, then as each searcher touches and interprets a part, eventually the whole elephant emerges. In search, therein lies the benefit of collaboration.

Q&A: Collaborative Information Seeking: Smarter Search for E-Discovery

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: In our last Q&A post (Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?), you talked about machine learning and collaboration. More than a decade ago, collaborative filtering and recommendations became a distinguishing part of the online shopping experience. You’ve been interested in collaborative seeking. What is collaborative seeking and how does it compare to receiving a recommendation?

DR. JEREMY PICKENS: Search (seeking) and recommendation are really two edges of the same sword.  True, there are profound differences between search and recommendation, such as the difference between “pull” (search) and “push” (recommendation). But these differences are not what primarily distinguish collaborative information seeking from collaborative filtering. Rather, the key discriminator is the nature (size and goals) of the team that is doing the information seeking.

With collaborative filtering, the “team” is just one person. You, alone and individually, are looking for a new toaster oven, or a new musician to listen to, or a new restaurant at which to dine during your vacation in Cancun. If one of your friends already owns that toaster oven, or a copy of that CD, or has dined at that place in Cancun, you might get a better recommendation about which option to choose. But it is not the fact that the friend already owns or has already experienced something that satisfies your information need. Rather, you are relying on the already satisfied needs of others around you in order to get better information about what is available to you, and thereby satisfy your own need.

Article Collaboration and Improvement DriveWith collaborative search, on the other hand, you are a member of a team consisting of at least one other person, possibly more. You are actively working together with that person to satisy a jointly held information need. My favorite example is of a couple looking to find a house or apartment. It does not help you to know that “people who bought this house also bought that house,” or that “people who live in this apartment also have lived in that apartment.” You are not going to move in together with all those people. You are going to move in with your partner.

And so as you are both searching for places to live, each of you enters different criteria about what is and is not important to you. You might like to live somewhere with great southern-facing exposure. Your partner might like a place with a garden. You might like a kitchen on the upper floor, and your partner might like enough work space in which to tinker on her motorcycle. A collaborative information seeking system should then attempt to find houses or apartments that satisfy both of your needs, jointly and simultaneously.

It is my belief that collaborative information seeking is much more appropriate to e-discovery than is collaborative filtering. Imagine collaborative filtering (“people who bought this also bought that”) in an e-discovery context: “People who have judged this document as responsive have also judged that document as responsive.” Of what value is it to know this? Given that someone else has already judged the document as responsive, why do I need to look at it? Unless I am doing quality control, it is simply a waste of time and client resources for the reviewer to judge again a document that has already been judged. Collaborative filtering falls apart in the e-discovery context, as it yields unnecessary repetition of labor. Collaborative filtering might work very well for toaster ovens, as you will still buy the toaster oven even if your friend has already bought the same model. It does not work well for e-discovery, as there is no sense in judging a document if your “friend” has already judged it.

By contrast, this is where collaborative search shines. Collaborative search allows you to find information that has not been viewed/judged/assessed by any member of your team of two or more people, but that is jointly relevant to the task that you are all working on, together. Collaborative search allows you and your team members jointly to push deeper into the collection, to documents that none of you would have likely found, were you working alone. Just as collaborative search allows you to find that house or apartment with both the southern exposure as well as the motorcycle workshop, it allows you to find documents that satisfy both the lead counsel’s as well as the review manager’s understanding of the task.

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.

Q&A: Is There a Google-like ‘Magic Bullet’ in E-Discovery Search?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Information retrieval is discipline from the 1970s. Relational databases arrived in the 1960s. Most e-discovery platforms combine full text search (from information retrieval) and a relational database. What do you think is new and exciting in the world of e-discovery with tools that are 40 and 50 years old? Do you think there is a magic algorithm that will be used in e-discovery that will be as disruptive as Google PageRank was for broad Internet searching?

JEREMY PICKENS: There are a number of different angles from which one could approach this question. Recall from a previous blog post that one of the primary distinguishing factors between web search and e-discovery search is that the former is geared toward finding the one best answer, such as a factoid or a home page (precision-oriented), whereas the latter typically requires thousands if not millions of relevant (responsive) documents in order to satisfy an information need. This difference is not insignificant; it changes the entire nature of the search system being designed to meet that need.

Take PageRank, as per your example. It is important to understand that what makes PageRank work so well for web-oriented search has as much (if not more) to do with what the user is trying to accomplish as it does with the algorithm itself. Stop for a moment and read that sentence again. Web users typically want a single, best answer. And quickly. What is the best way to satisfy that information need? It is to give the web searcher a result that a lot of other people already think is pretty good, e.g. “votes” that come in the form of link data. If enough web pages link to single web page and use topically relevant keywords in that link’s anchortext, that web page will be boosted in the rankings. That web page will be “voted” to the top.

More to the point: The specific algorithm that is used to count those votes is not as important as simply having the votes in the first place. Having the votes is what moves your page from page 57 of the results to page 1. A better algorithm might move the page to rank #2 on page 1, rather than rank #9 on page 1. But 90% of what got that document to page 1 was the votes themselves, rather than the mathematics of how the votes were counted. And simply being on page 1 accounts for 90% of the success of PageRank, as typical web searchers will only look at the first page of results and almost never further.

In summary, it is not so much the PageRank algorithm (mathematics) that makes PageRank so successful. It is the signal (link “votes”) used as input to the algorithm; the signal correlates well with the ultimate user goal.

So the question is whether there will ever be a magic algorithm for e-discovery that will be as disruptive as PageRank. This is the same as asking whether there will ever be a single signal (such as a link “vote”) that correlates well with the user goal or intention. At the risk of making too bold of a claim, I think that the answer is no.

Jeremy Pickens

An e-discovery searcher’s information need simply does not fit the “magic bullet” profile. Someone engaged in e-discovery does not look at the first page of results and stop. That person (or a team of reviewers) may look at 20 pages. Or 100 pages. So whether one of the many available relevant documents is on page 1 or on page 57 matters much less. The user information need does not match what PageRank — or PageRank-like magic bullet algorithms — is trying to do.

Magic bullet algorithms try to get the absolute single best result (or small handful of few results) to the very top of the list. E-discovery users need thousands or millions of relevant results. And when there is that much information, there is going to be a huge diversity of signals and coordination between dozens of various algorithms to exhaustively find everything.

Please note, however, that this does not mean algorithmic approaches will not work for e-discovery. Quite the contrary; e-discovery is in need of more, better and smarter algorithms. And these algorithms will improve our ability and capacity to meet the e-discovery challenge. It is just that the algorithms developed will not be “magic bullet” algorithms. They will be like a well-coordinated orchestra, with dozens of components playing together in unison.

(Image: Felipe Micaroni Lalli per Creative Commons.)

Search Q&A: How to Evaluate the Quality of an E-Discovery Search Platform

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Since e-discovery is already costly and time consuming, there doesn’t seem to be a good way for customers to compare offerings by running a case in different systems. Besides sales slicks, acronyms and generic testing such as TREC, how do you think customers should evaluate the quality of the platform they have chosen to handle e-discovery?

JEREMY PICKENS: This is a good question, and one to which there is no single, easy answer. That said, one possibility would be for the platform itself to give you internal metrics in the form of goal-oriented progress prediction. For example, if your goal is to find all responsive or privileged documents in a collection, a good platform should not only give you an estimate of how many more responsive documents it thinks are available to be found, but also let you track the history of that prediction before and after various events.

One should be able to get a sense of how right or wrong that prediction was, as one’s session-based information-seeking task progresses.

Specifically, if that estimate changes drastically after the execution of a new query, or after the responsiveness coding of a particular set of documents, that should be brought to the user’s attention. In a quality platform, it is less important that the platform get the prediction right at the very beginning, than it is the platform is forthcoming and transparent with its mistakes. This should allow the user to work concertedly with the platform toward the goal.

Note: Jeremy Pickens, Bruce Kiefer and John Tredennick have written a research paper that expands on this topic, Process Evaluation in eDiscovery as Awareness of Alternatives. Pickens will present the paper June 6 at the ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings (DESI IV).

Search Q&A: How Search Engines Differ From Databases in Retrieving Information

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: Basic search is often considered table stakes for e-discovery. Yet it seems few people understand that relational databases and search engines structure and retrieve information differently. Can you discuss the importance of rank and retrievability within search?

JEREMY PICKENS: Perhaps the biggest conceptual difference between relational databases and search engines is that the former operates on structured data, the latter on unstructured. With structured data, the types and relationships between various pieces of information are known. For example, with structured information, not only do I know that Product X has a weight, color and price, and not only do I know that price and weight are numeric while color is a character string, but I know that price is given in Euros and weight is given in kilograms.

Search engines, on the other hand, work with unstructured data, such as Word documents or email. The semantics and relationships are not known. For example, imagine the term “1055″ in an email. Does that term represent the weight of an object, the number of objects or the serial number? Similarly, is “baker” somebody’s name or somebody’s profession

Search engines have to deal with this uncertainty, and retrieve documents with the highest likelihood of meeting your information need. Relational databases instead return all records that match a semantically-specific query expression.

Search Q&A: What E-Discovery Search Can Learn From Music Search

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

Bruce Kiefer

BRUCE KIEFER: You’ve been involved with e-discovery for just over six months. You’ve attended LegalTech in New York in January, studied companies and talked to customers. How would you describe the state of information retrieval in e-discovery compared to your previous research in music and collaboration?

JEREMY PICKENS: Information retrieval is a decades-old field with many different types, techniques and algorithms supporting a wide variety of user information needs. Not all technologies are equally appropriate in every context; some are more appropriate to the web (navigational searches), some to music (recommendation engines), and some to e-discovery (recall-oriented informational searches).

E-discovery has borrowed and experimented with many classic information retrieval techniques, such as query suggestion via term co-occurrence analysis, query expansion via analysis of morphological variants (e.g. “stemming”), and clustering. I also find it encouraging that relevance feedback is making its way into e-discovery in the form of predictive and suggestive coding. The field’s core retrieval framework is Boolean, rather than more modern probabilistic and learning-to-rank approaches, which is unfortunate. Modern ranking algorithms are much more effective. The main impediment to moving beyond Boolean retrieval seems to be legal precedent rather than lack of technological prowess, so I am confident that this will continue to evolve.

Jeremy Pickens

There is one area, however, in which e-discovery could learn from work in music search and collaborative search. That area is the notion of information seeking as a multi-stage, session-based task. Most traditional information retrieval frameworks have been developed for ad hoc information needs, meaning that a query is issued and an answer given, at which point the interaction ends. The system treats subsequent queries as unrelated.

E-discovery is different, and is more in line with music search and collaborative search, in that the users’ information-seeking activity is ongoing and more than a single piece of information (e.g. document) is sought. Session-oriented thinking allows for a different approach to retrieval system design, and I expect to see an increase in awareness for these types of approaches.

Search Q&A: Introduction to a Series of Posts on the Science of Search

A good friend of mine, Miles Kehoe, co-author of the blog Enterprise Search, tells the story of his days at Verity when Google was emerging. Verity held a focus group to get human feedback from users about the quality of their search results. For some simple A-B testing, they also included feedback from customers using Google. It didn’t take long for Verity to realize that people really liked Google.

To confirm this observation, they changed the parameters of the test. They mimicked the minimalist style of Google’s result list and put the Google logo on top of the Verity results. Now, the focus group was divided in two: people looking at the Verity results with a Google logo and people looking at the Verity results with the Verity logo. The content was the same and the focus group still preferred the results from the page with the Google logo.

Yahoo! shut down AlltheWeb on April 4, 2011.

Brands are really powerful as I learned again from the Verity-Google story. Google’s BackRub (as Larry Page and Sergey Brin first called their search engine) is a useful ranking system in the world of online content. From this algorithm, Google upset the status quo of the day. Companies like AlltheWeb, Lycos and AltaVista had different algorithms and quickly saw their market share eroding. Yahoo acquired AlltheWeb and finally shut it down on Apr 4, 2011.

The world of search on the web and the world of search in e-discovery are different animals. The plumbing of search shares many components, whether your goal is to find out about the royal wedding, get a recommendation about a digital camera, or find a key custodian in your collection. The best way to use search in e-discovery depends on your goal—reverted queries, inverted indexes, term frequency, entropy and other tools are available. Why does one algorithm work well in one use case and not work in a different use case?

In this series of posts, prodded by questions from me, Catalyst Senior Applied Research Scientist Jeremy Pickens will discuss some of the nuances within e-discovery and search.

Now on to our first Q&A ….