
Information retrieval researchers at Shonan last week. That's me in the center, wearing the yellow T-shirt.
Last week, I was honored to join a small group of information-retrieval researchers from around the world, from both industry and academia, who gathered at the Shonan Village Center in Kanagawa, Japan, to discuss issues surrounding the evaluation of whole session, interactive information retrieval. In this post, I introduce the purpose of this meeting. In later posts, I hope to further review the discussions that took place at Shonan and my own impressions.
Traditionally, information retrieval (a.k.a. search) has been viewed as a stateless, non-interactive process. The user issues a(n ad hoc) query to a search engine and the engine responds with its best attempt at answering that query, with results ranked by their likelihood of satisfying a user’s information need.
Interactive information retrieval, on the other hand, presumes multiple rounds of user-system exchange. The interactions during this exchange are presumed to be non-independent. Each query has some sort of relationship to previous queries, if only because the overall series is in support of the same user task or goal.
Examples of scenarios in which interactive information retrieval is necessary include travel or event planning, education and learning, seeking entertainment, and (of course) e-discovery. When queries are independent, the best the system can do is answer each query as if it were the last that the user will ask. However, when queries are non-independent, both the user and the system have the chance to engage in deeper and wider patterns of exploration.
Evaluating Interactive Information Retrieval
Evaluation of one-shot queries has a long and rich history. Concepts such as “binary relevance” and “precision and recall,” combined with batch mode evaluation, have led to countless advances in the state of the art. These advances, from the 1960s to the 1990s, allowed search engines, especially in a web context, to improve to the point at which they now bring huge benefits to society. Evaluation of interactive information retrieval tasks, on the other hand, does not have as-yet universally accepted metrics. The very nature of the interactivity (non-independence of a sequence of user actions and system responses) both gives the scenario its power and makes it difficult to evaluate.
The power, again, comes from the breadth and depth of what is made possible; the evaluation difficulty by this very same interdependence. When a single query is performed, it can be generally expected that the user traverses the results list in linear order, from estimated best to estimated worst result. And, with some probability, the user abandons the list traversal. These (generally realistic) assumptions allow the ad hoc, one-shot query to be evaluated in terms of the position of relevant documents within the list.
However, when multiple queries are performed, an element of non-determinism enters into the picture. A user typically does not examine all results in the list from the first query, then all results in the list from the second query, and so on. Instead, one user might only examine 57 results from the first query, 9 results from the second query, and then 82 results from the third query. Another user might examine 3 results from the first query, 18 results from the second query, and 17 results from the third query.
Furthermore, the order in which the results are seen by the user affects the next round of interactivity. That is, the second and third queries that are issued by an information seeker are influenced by which documents were seen during the first round of interaction. Even if two users started with the same first query, the user who looked at 57 results might have a very different notion of how to formulate the next query than the user who looked at only 3 results.
How, then, should these two users’ experiences with the interactive search engine be evaluated? Should it be the product or sum of the quality of the individual ranked lists for each query? That ignores the depth to which the user actually traveled in each list over the course of the session. Should evaluation instead be a function of the sequence of documents that the user actually saw during the course of the session, no matter which individual results list a document came from? That is better, but it still ignores the effects of document examination order on the queries that were issued — and more importantly on the queries that could have been issued, had the user traversed to either a shallower or deeper position within a particular list. The non-deterministic range of possibilities poses a severe challenge to the evaluation of interactive information retrieval.
Another issue related to whole-session evaluation in interactive information seeking has to do with progress during versus upon completion of an entire session. Should the primary focus of evaluation be to estimate the quality of a session only at the end of the user’s sequence of interactions? Or is it more important to have a metric which measures, i.e. expects, progress throughout a session? Inherent in the answer to this question is whether one expects interactive information retrieval progress to be linear. Is it? Should it be? The answer is an open question, one which we discussed at the Shonan Meeting.
Evaluation drives innovation. If you cannot measure something, you cannot improve it. The first step to improving interactive information retrieval systems is knowing what to measure and how to measure it. Only then will consistent improvements be possible.