Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

Search Q&A: Learning to Read the ‘Signals’ Within Document Collections

[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]

BRUCE KIEFER: What are “signals” and how can they improve search?

DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).

Smoke Signals by Frederic RemingtonBy itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.

On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential.

But signals can also be more abstract. Take the example of whether the 17th word in any particular document is capitalized. Generally, knowing this is probably not useful. But what if you knew that 30 of the past 35 documents that have been marked as responsive all contain a capitalized word at the 17th position and none of the non-responsive documents do? If you are able to identify that signal, then the signal can be amplified within the search algorithm itself so as to steer you towards additional documents with the same signal.

Signal selection, or determining which signals to measure and track, is an open problem. It is often domain dependent, if not matter dependent. There are some generally useful signals, such as word presence, word frequency, anchortext hyperlinks (in the case of web documents) or to/from “hyperlinks” (in the case of email). But determining what other signals to employ involves a mixture of intuition, mathematics, and experimentation. When it is done correctly, though, it yields huge gains in ranking algorithm effectiveness.

Jeremy Pickens About Jeremy Pickens

Jeremy Pickens, Ph.D., is one of the world's leading search scientists and a pioneer in the field of collaborative exploratory search, a form of search in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has six patents pending in the field of search and information retrieval, including two for collaborative exploratory search systems.

At Catalyst, Dr. Pickens researches and develops methods of using collaborative search to achieve more intelligent and precise results in e-discovery search and review. He also studies other ways to enhance search and review within the Catalyst system.

Dr. Pickens earned his master's and doctoral degrees at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King's College, London, on a joint grant with Goldsmiths University of London. As part of the OMRAS project (Online Music Recognition and Searching), he helped organize the first Music Information Retrieval (ISMIR) conference in Plymouth, Mass. Before joining Catalyst, Dr. Pickens spent five years as a research scientist at FX Palo Alto Lab, where his major research themes included video search and collaborative exploratory search.

Dr. Pickens is co-author of the forthcoming book, A Taxonomy of Collaborative Information Seeking, to be published by Morgan & Claypool Publishers. He was an editor of the spring 2010 special issue on collaborative information seeking of the journal Information Processing and Management. He is a frequent author and speaker on the topic.

Share Your Thoughts

*