I recently wrote an article challenging the belief that subject matter experts (SMEs) are required for training the system in technology-assisted review. (See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?) This view, held by almost everyone involved with TAR, stems from the common-sense notion that consistency and, ultimately, correctness are critical to a successful ranking process. Indeed, Ralph Losey made that case eloquently in his blog post: “Less Is More: When it comes to predictive coding training, the ‘fewer reviewers the better’ – Part One.” He argues that one SME is the “gold standard” for the job.
[Download this article in a PDF version.]
Putting our science hats on, we tested this hypothesis using data from the 2010 TREC program. As I reported in the earlier article, we ran tests comparing the ranking effectiveness using seeds from the topic authorities (SMEs) and the reviewers. We found that the rankings for both the SMEs and the reviewers were close–there was no clear winner although one did better than the other at different points or on different projects.
Interestingly, we found that using the experts to QC 10% of the review team’s judgments proved as effective as using experts alone and sometimes better. We also showed that using experts for QC review was both quicker and cheaper than using experts alone for the training.
Pushing the Limits
For this post, we decided to push the limits of our earlier experiment to see what might happen if we pitted the experts against the review teams in yet another way. For this experiment, we sought out the specific training documents where the SMEs (topic authorities) expressly disagreed with the judgments by the review team. Specifically, we decided to identify documents where the experts voted one way (responsive, for example) and the reviewers the other way (non-responsive). Our plan was to use these documents, and these documents only, to train our system. We wanted to see which set of judgments would produce the better results¾those from the experts or the conflicting judgments from the review team.
To be completely clear: In these experiments, the judgments from the review team were not just slightly wrong. The review team judgments were, from the perspective of the topic authority, 100% wrong. Every single document in these training sets that the topic authority marked as responsive, the review team marked as nonresponsive, and vice versa.
As in my earlier article, we again worked with four topics from the TREC legal track. For each topic, we sought out documents where the judgments of the experts and reviewers conflicted. We used those conflicting judgments (and only those judgments) to twice train the algorithm: once using the experts’ judgments and once using the review team’s judgments. We wanted to see which approach produced the better ranking.
Following the methodology of my earlier article, we also did a third run of the algorithm. For this one, we simulated using the experts for quality control. We assumed that the experts would review 10% of the review team’s judgments and correct them (because in this case we already knew that the experts disagreed with the review team). Thus, the third run contained a mix of the review team’s judgments (arguably mistakes) with a randomly selected 10% of them corrected to reflect the experts’ (arguably correct) judgments.
Because we are training using only documents on which the topic authorities and non-authorities disagree, the absolute level of performance on these topics is lower than if we had used every available training document, i.e. all those additional documents on which both the authorities and non-authorities agreed. That, however, is not the purpose of this experiment. Our goal was to isolate and assess relative differences between these two variables.
Sound like fun? Here is how it turned out.
Quick Primer on Yield Curves
As before, we present the results using a yield curve. A yield curve presents the results of a ranking process and is a handy way to illustrate the difference between two processes. The X axis shows the percentage of documents that are available for review. The Y axis shows the percentage of relevant documents (recall) found at each point in the review.
The higher the curve is and the closer it is to the top left corner, the better the ranking. The sharply rising curve signifies that the reviewer is presented with a higher percentage of relevant documents, which is the goal of the process.
The gray diagonal line shows the results of a random presentation of relevant documents that is the expected outcome of linear review. On average, the reviewer can expect to see 10% of the relevant documents after reviewing 10% of the total, 50% after 50%, and so on until the review is complete. It presents a baseline for our analysis because review efficiency shouldn’t get any worse than this.
In this experiment, the ranking curve based on the expert judgments performed better than the one based on the review team’s judgments. The expert training hit 80% recall after reviewing about 10% of the total documents. The review team reached 80% recall at about 22% of the total population. When we based the training on the set of documents that included 10% QC correction from the SMEs, the review team would have reached 80% after reviewing about 15% of the total population.
One point is worth noting here. Even though the expert performed better than the review team, a ranking based on reviewer judgments still performed substantially better than manual linear review. Even relying solely on the review team’s training, you still only have to go about 22% of the way through the collection to get to 80% recall. With manual review, you’d have to go 80% of the way to get 80% recall.
In this case, the results are similar to Issue 1, albeit with different percentages of the total document population that needs to be reviewed. Using expert judgments only, you would reach 80% recall after reviewing about 48% of the total document population. Using reviewer judgments only, you would hit 80% at just under 57% of the review. Using an expert to perform QC of review team judgments, the reviewers would have hit 80% at about 53% of the review.
This case was particularly interesting. Using the same 80% recall threshold, the expert judgments and the (diametrically opposite) review team judgments brought the same results. You would only have to review about 30% of the document population regardless of whether you built the ranking on expert judgments or review team judgments.
Even more interesting, the experts’ ranking curve became worse after reaching 80% recall. The review team judgments produced a superior ranking as did the QC process.
Here again, the ranking based on expert judgments did substantially worse than the ranking based on review team judgments. Specifically, the expert did significantly worse in the early stages than the review team or the review team supplemented by 10% QC review by an expert.
Interestingly, the lines converge at about the 80% point and stay together after that. So for recall thresholds above 80%, you would have gotten the same results regardless of whether you relied on an expert’s judgment or that of your review team.
It is important to note that in all of these experiments, the expert judgment as to responsiveness or nonresponsiveness of a document was used as ground truth, i.e. to draw all yield curves. Thus, even for the rankings based on review team judgments, we evaluated the quality of those rankings based on the expert judgments.
What does this all mean?
From a scientific perspective, perhaps nothing at all. We took a relatively limited number of training seeds and based our ranking off them. In two examples, rankings based on the expert’s judgments outperformed the rankings based on review team judgments. In the other two examples, the review team judgments seemed to produce as good or sometimes better rankings than the experts.
It bears repeating that we used only documents on which the topic authorities and non-authorities disagreed. For that reason, the absolute level of performance on these topics is lower than if we had used every available training document, i.e. all those additional documents on which both the authorities and non-authorities agreed. That, however, was not the purpose of these experiments.
The point of these experiments was to see what happened when things go very badly during the training process. We intentionally removed from training all documents that the experts and review teams agreed on. Instead we simply focused on what might happen with opposite judgments on documents which likely required a judgment call on relevance¾using our algorithms of course.
What we see from the experiments is that even if the review team was 100% wrong, it didn’t completely destroy the value of the ranking results. Even at their worst, review team ranking were still significantly better than linear review and often matched those based on expert judgments.
The point here is to suggest that you have options for training. Some will want to use the subject matter expert extensively for training. That approach works fine with us, our system and our algorithm.
But others may prefer a different approach. Our research suggests strongly that the expert can focus on finding useful exemplars for training, sample as much or as little as they like, and do QC during the course of the review. Meanwhile, the review team gets going right away and you can take advantage of continuous ranking. By its nature, continuous ranking will require that you use the judgments of the review team for training purposes.
It is also noteworthy in these experiments that, even when the review team’s judgments were not as effective as the expert’s, they nevertheless yielded good results. More to the point, even when they were 100 percent wrong, as they were in these experiments, they produced good results – really good results, relative to manual linear review.
This counters the suggestion of the SME-only advocates that the training documents have to be right or else the overall process is going to be completely wrong. That argument misses the point. The point isn’t that the process completely succeeds or completely fails. The point is that it is a matter of degrees. Even when the training is 100% wrong, the process does not completely fail. To the contrary, it performs well.
One final note to the research scientists and TAR geeks who are reading this. By now, you’ve no doubt realized that we’ve spilled a bit of our secret sauce in this post. By telling you that we used judgments that were 100% wrong, and still got results better than the random baseline, we have revealed hints about our proprietary TAR technologies here at Catalyst and how we conceptually approach our algorithms. Not every algorithm is going to be able to use 100% wrong judgments and still achieve better-than-random results. Even so, we believe that the value of openly sharing our results and encouraging this discussion outweighs the loss of any of our sauce.
We will continue to look for opportunities to compare the judgments of SMEs and review teams but we hope that our initial experiments will contribute to this interesting discussion.