[This is another in a series of search Q&As between Bruce Kiefer, Catalyst's director of research and development, and Dr. Jeremy Pickens, Catalyst's senior applied research scientist.]
BRUCE KIEFER: What are “signals” and how can they improve search?
DR. JEREMY PICKENS: Signals are objectively measurable and quantifiable properties of a document or collection (or even user). Signals could come from the document itself (data) or from information surrounding the document, such as lists of users who have edited a document, viewed a document, etc. (metadata).
By itself, a signal does not necessarily make the search process better. Sure, there may be an instance when the user may want to inquire directly about whether, for example, the 17th word in a document is capitalized. The positional information (17th word) and the case information (capitalized or not) are both signals. But more often, signals are used to improve search algorithms through training, and to improve individual search processes through relevance feedback. Signals are the raw fuel on which those improvements power themselves.
On a basic level, something as simple as a name can be a signal. The name of a lawyer within a document is a signal that it may be privileged. The name of a product may be a signal that a document is confidential.
But signals can also be more abstract. Take the example of whether the 17th word in any particular document is capitalized. Generally, knowing this is probably not useful. But what if you knew that 30 of the past 35 documents that have been marked as responsive all contain a capitalized word at the 17th position and none of the non-responsive documents do? If you are able to identify that signal, then the signal can be amplified within the search algorithm itself so as to steer you towards additional documents with the same signal.
Signal selection, or determining which signals to measure and track, is an open problem. It is often domain dependent, if not matter dependent. There are some generally useful signals, such as word presence, word frequency, anchortext hyperlinks (in the case of web documents) or to/from “hyperlinks” (in the case of email). But determining what other signals to employ involves a mixture of intuition, mathematics, and experimentation. When it is done correctly, though, it yields huge gains in ranking algorithm effectiveness.
[...] Q&A: Learning to Read ‘Signals’ Within Document Collections – http://bit.ly/rkJHsO (Jeremy Pickens, Bruce [...]