Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

Q&A: How Can Various Methods of Machine Learning Be Used in E-Discovery?

[This is one in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: There are a lot of tools in use from various vendors in e-discovery. At Catalyst, we’ve been using non-negative matrix factorization ( see, Using Text Mining Techniques to Help Bring Electronic Discovery Under Control) as a way to understand key concepts in a data collection. Can you describe the differences between supervised, unsupervised and collaborative approaches to machine learning? How could each be used in e-discovery?

JEREMY PICKENS: With reference to machine learning, the notion of supervision refers to having ground truth available. Ground truth means that you have data instances that are labeled in accordance with your goal, such as “responsive” and “non-responsive” or “privileged” and “non-privileged.” If this information is available for a small subset of one’s entire collection, it can be used to build (infer) a model. This model can then be used to label the rest of the (unseen) documents in the collection. Such labels can be accepted as is, or used as the basis for a smart prioritization for manual review.

With unsupervised learning, on the other hand, no such labels are available. Instead, the goal is to analyze the collection and extract interesting statistical patterns and relationships. Who emailed whom and when? What are the primary or most frequently occurring topics? What topics are related to each other? Unsupervised learning teases out the answers to these questions, and the answers can then be used to guide an e-discovery searcher in the information seeking task. It can help the information seeker formulate the correct search queries.

While it might seem that supervised learning is always preferred over unsupervised, the latter definitely has its advantages. For example, the ASK, or Anomalous States of Knowledge (Belkin, 1980) theory of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are underspecified.

So instead of building a search system that brings back the best matches to a particular query, or that infers labels for every document in the collection based on a small seed set of labeled documents, it is sometimes better to help the user explore and understand what the collection is about. This exploratory phase, guided by the patterns extracted by an unsupervised learner, can then help e-discovery reviewers more clearly formulate the right questions to ask and come to a greater understanding of what they are trying to accomplish — for example, what it really means for something to be responsive or privileged.

By contrast, the collaborative approach is not so much a machine learning technique by itself. Rather, it is a strategy over machine learning techniques, and one that involves multiple searchers or reviewers explicitly working in concert. The advantage to collaboration is that, rather than deciding to work completely supervised or completely unsupervised, you can do both at the same time. Now, how one coordinates the various strategies matters to the final outcome. But simply acknowledging that different e-discovery team members can work on different parts of the problem takes us a long way toward a better solution.

In some ways, collaboration is complementary to the concept of active learning. Rather than a fully supervised approach (which operates on a static set of labels) or a fully unsupervised approach (which is better suited to exploration and sensemaking), active learning explicitly attempts to minimize manual (aka “expert”) label decisions by picking the most representative or most discriminatory data points to label.

Rather than just picking items to label at random and sticking with them, active learning is an iterative, interactive process that decides which data point should be labeled so as to best serve the overall goal of building a model for all the data points. Note that this is not (necessarily) the data point that has the highest (or lowest) probability of being responsive or privileged, but the data point that best helps build a robust, accurate model. In many ways, the goal of collaboration is similar.

Bruce Kiefer About Bruce Kiefer

Bruce Kiefer directs Catalyst's Research and Development Group, helping to develop the next generation of our technology, and is vice president of our Hosting Applications Division. He has worked in IT for many years, helping to build, deploy, manage, scale and repair networks and systems that solve problems.

Before joining Catalyst, he was vice president of operations for Viawest Internet Services. During Bruce's tenure at Viawest, he built many of the internal tools, grew the network to four states, and took over product management for Viawest's managed hosting offering.

In addition to his IT expertise, Bruce has a master's degree in business administration. He joined Catalyst in 2005, where he combines his knowledge of technology and business to help drive product development and build out operations.

Share Your Thoughts

*