Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices
John Tredennick

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.

Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer's Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer's Guide to Microsoft Excel 2007 (ABA Press 2009).

John is the former chair of the ABA's Law Practice Management Section. For many years, he was editor-in-chief of the ABA's Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents.

Videos: The LegalTech Band Proves E-Discovery Can Rock

In the midst of LegalTech New York this year, Big Data and the Gigabytes took to the stage at The Three Monkeys bar to show that e-discovery professionals know how to rock a joint. Although some of the band members are competitors by day, they proved they could put out some pretty tight sound when they came together that night.

(To read more about the band and see a gallery of photos, see our earlier post.)

The band started off with a rocking rendition of Lynyrd Skynyrd’s “Sweet Home Alabama”:

Next up, the bank got the audience dancing with a cover of the 1971 hit “Do You Know What I Mean” by Lee Michaels:

From there, the band played “Dead Flowers” from The Rolling Stones’ 1971 album Sticky Fingers:

Next on the set list came Delbert McClinton’s “Standing on Shaky Ground”:

The band closed with a 1978 song by The Rolling Stones, “Just My Imagination,” their cover of the song that was originally a hit for The Temptations:

The LegalTech Band Rocks New York City

On Wednesday, Jan. 30, in the midst of LegalTech New York, the place to be was at The Three Monkeys bar near Broadway and 54th street. At least it was the place to be that evening after 9 p.m. if you like good rock and roll.

The LegalTech band took the stage at the upstairs bar around 9:30 that evening to show that e-discovery professionals still have what it takes to rock the joint. It was the night for Big Data and the Gigabytes (formerly Predictive Chording).

[For videos of the band's performance, click here.]

The band consisted of music professionals from across the industry.

  • Lance Doss, director of corporate/legal technology, Access Staffing.
  • Marc Kronenberg, president, Forum Technology Group.
  • Sean McMechen, vice president of project management, DiscoverReady.
  • Tom Barnett, managing director and firm-wide e-discovery practice leader, Stroz Friedberg LLC.
  • Randy Burrows, vice president and general manager, Xerox Litigation Services.
  • Dean Kuhlmann, VP of business development, Lateral Data.
  • Lou Verrelli, principal, NextGen Reporting.
  • John Tredennick, president and CEO, Catalyst Repository Systems.

Ardent competitors in some cases, they teamed up for a night of fun and music.

Videos of their full set can be seen here. The set consisted of:

  • Lynyrd Skynyrd’s “Sweet Home Alabama”
  • Lee Michaels’ “Do You Know What I mean?”
  • “Dead Flowers” by The Rolling Stones
  • Delbert McClinton’s “Standing on Shaky Ground”
  • “Just My Imagination,” The Rolling Stones’ cover of The Temptations’ hit.

It was a great night for everyone who attended.

The rhythm section warms up.

The front line is rocking.

John Tredennick on drums.

Sean McMechen on lead.

A blur of rocking drumming.

Dean Kuhlmann on bass.

Lou Verrelli on sax.

John keeps the beat.

Sean takes it home.

My Key Word Searches are Better than Your Predictive Ranking Technology!

I recently got a distress call from an e-discovery partner of ours with an unhappy client. “It seems like there is something wrong with your predictive ranking technology,” our partner said on the Google Hangout. “It’s proposing that the client team review too many documents–more than we got with key word searching. Our client is upset. We need to do something to explain this fast.”

In this case the client team had not used technology assisted review (TAR) before; this was their first try at the process. They wanted proof that it was worth the extra cost for the technology. Specifically, they wanted to see whether it actually cut down on review costs, like everyone claimed.

The problem was that the system didn’t seem to work–at least in their eyes. They had started the review process by running a series of key word searches, which was their normal practice. The searches hit on a total of about 11,000 documents, out of a test set of 50,000 documents. This suggested they had only 11,000 documents to review for the production, about 20% of the total collected.

Our partner had recommended that the client try our Predictive Ranking technology as a better means to find responsive documents. Everyone’s initial expectation was that doing so would reduce the review population even further than the 11,000 documents that the key word searches hit. Somehow they got the impression that the review population might go down to more like 7,000. That would certainly justify the extra expense of this new technology.

Unfortunately, the opposite turned out to be the case. Instead of recommending that the team review 7,000 documents, or even 11,000 documents, our system suggested reviewing more than 18,500 documents.

You can imagine the client’s consternation. “You want us to pay you for these results? You just increased rather than reduced my review costs. I like it better the old way.”

It was time to go to work to figure out what happened. Fortunately, our team had analyzed the key word search results and compared them to the documents identified through Predictive Ranking. My job was to explain the difference between the two approaches. My hope was to show that the system worked well and provided a better outcome than key word search¾at least if the goal was to identify and review potentially relevant documents.

To be sure, our Predictive Ranking system came up with more documents to review than did the key word searches. However, we quickly concluded that the key word searches–while finding many potentially responsive documents–missed a lot of others that should be considered as well. Here is how we reached that conclusion.

The Numbers

Let me start by giving you some basic information about the two processes. From there it becomes a bit easier to explain the difference in the results. You can then see how we got to our ultimate conclusion.

The client collected just over 51,000 documents for this production. As a first step, counsel created a set of key word searches and asked our partner to run them using our PowerSearch utility. As I mentioned earlier, the searches hit on about 11,000 documents.

We then started the Predictive Ranking process, which is the name we started using years ago for our TAR methods. The first step was to take an initial random sample for reference purposes and to estimate the overall richness of the population. We came up with an estimate of 22%, which would suggest that there were about 11,000 relevant documents in the collection.

Hmmm. That was pretty close to the number of documents found through the key word searches. Did they nail it this time?

We next worked with the client’s legal experts to review and tag seed documents and then to undertake several rounds of system training. At the end of the process, the system suggested a review cutoff below 36%. That meant that the review team would look at the top 36% of the documents and ignore (after a confirmatory sample) the remaining 64%.

The resulting numbers looked like this:

  • Likely responsive and need review: 18,552.
  • Likely non-responsive and don’t need review: 32,982

Our sampling also suggested that the top documents above the cutoff had a richness of about 50% (which meant half of these were likely responsive). The documents below the cutoff had a richness of about 7% (which meant that only 7 out of 100 were likely responsive). Seven percent seemed like a good number for the discard pile, one that most courts would accept.

The Question

As mentioned earlier, our results caused heartburn for our partner and its client. The key word search approach seemed to require that the team review only 11,318 documents. Why pay for Predictive Ranking if it requires that the team review 7,000 additional documents? That’s about 60% higher than the key word results.

Understanding the Numbers

The answer requires that we better understand what all these figures mean. Unless we can compare apples to apples, we have no way to judge the efficacy of the two approaches. Fortunately we had an easy way to do just that.

We compared the document IDs for the files returned from the key word searches with the files returned from our Predictive Ranking process. What we found was pretty interesting. Let me show you with this simple diagram:

I created this diagram to map the comparative results of our Predictive Ranking and the key word searches. The circle represents the total documents in the population. If you add up the numbers in the four quadrants, it comes up to the 51,534 files at issue

The four quadrants represent the different states of the document population.

  • The top left quadrant represents documents that our Predictive Ranking system found likely responsive but did not return from the key word searches. There were 11,285 documents in this category.
  • The top right quadrant represents documents that hit under both approaches. These documents were returned by the key word searches and were also designated by our Predictive Ranking system as potentially responsive. There were 7,267 documents in this category.
  • The bottom left quadrant represents documents that did not hit under either approach. Neither our Predictive Ranking system nor the key word searches deemed them likely responsive. There were 28,931 documents in this category.
  • The bottom right quadrant represents documents that were returned from the key word searches but were not deemed likely responsive by our Predictive Ranking system. There were 4,051 documents in this category.

So, what can we say about all this? First, we can say that the team should probably review the 7,267 documents found in the top right quadrant. Both approaches tagged them as likely responsive. That does not mean that they will all be responsive but it is a good bet that a lot of them are.

Second, we can suggest that the 28,931 documents in the bottom left quadrant  include few responsive documents. Neither the key word searches nor our Predictive Ranking system hit on these documents. There is still a need for confirmatory sampling but we can be pretty sure that there are not a lot of responsive documents hiding in this quadrant.

The Two Key Quadrants

That leaves us with two quadrants to consider and this is where we find the answer to our puzzle. Together, these two quadrants represent about 15,000 documents. Here is what we can say about each:

  • The top left quadrant represents 11,285 documents that our Predictive Ranking system found as likely responsive. The keyword searches provide no information about these documents other than to say that they did not return from the searches.
  • The bottom right quadrant represents 4,051 documents that hit on counsel’s keyword searches but our Predictive Ranking system found to be likely non-responsive.

If counsel only reviewed documents that returned from the key word searches, they would be ignoring the 11,285 documents identified in the top left quadrant. Many of them had already been tagged during the Predictive Ranking training session and thus we knew that there were responsive documents in this quadrant. Our richness estimate went so far as to suggest that 50% of them were likely responsive, which meant that counsel might be missing 5,000-6,000 responsive documents using their key word approach. It quickly became evident that counsel would have to at least test additional documents in this quadrant before dismissing them as not responsive.

Conversely, our Predictive Ranking system led us to question how many of the 4,051 documents in the lower right quadrant were responsive. In fact, we knew from training that many of the documents in that quadrant were not responsive. At the least, that is what the reviewers concluded when they addressed them during the sampling.

We suggested that the client test the documents in this quadrant before engaging in review. Our suspicion was that these were false hits from the key word searches and not likely of interest. Our estimate was that they would find a richness of about 7%.

Answering the Question

By now you have already figured out how to respond to the client’s concerns. Simply put, the key word searches–while effective at finding some of the potentially responsive documents–missed a lot of others that should be reviewed. The Predictive Ranking system found many of the documents returned from the key word searches but it also found a lot of other potentially responsive documents. The total numbers were higher but there was good reason for that outcome. There were more documents that needed to be reviewed.

Put another way, search has two qualitative measures: precision and recall. Precision is a measure of the number of true hits (actually responsive documents) returned from your search compared to the total number returned. Recall is a measure of the total true hits returned from your search against the actual number of true hits in the population.

In our case, the key word searches may have been good on precision (assuming that the documents in the top right quadrant were, in fact, responsive). However, they seemed to miss the boat on recall. The searches missed a lot of the other responsive documents. That is not a good thing if your opponent chooses to challenge your production in court.

It turned out that my explanation proved helpful to our partner and the client team. They moved forward with their review using the documents ranked by our Predictive Ranking system. It turned out that there were a lot of responsive documents missed by the key word searches and many of the documents returned by the searches in the lower right quadrant were false hits. The explanation and diagram helped to clear up the mystery. I thought it might be helpful to others as well as they grapple with the mysteries of technology assisted review.

There may also be a moral to this story, so to speak. Discussion of technology assisted review often focuses on its ability to reduce document populations. But review is not just a numbers game—it’s also about getting it right. It does neither lawyers nor their clients any good to cut document populations if they are cutting a large number of potentially responsive documents in the process. As my story above illustrates, fewer is not always better. Here, Predictive Ranking proved itself superior to key word searching at getting it right. That may have saved counsel some grief further down the road.

Judge Peck Provides a Primer on Computer-Assisted Review

Magistrate Judge Andrew J. Peck issued a landmark decision in Monique Da Silva Moore v. MSL Group, filed on Feb. 24, 2012. This much-blogged-about decision made headlines as being the first judicial opinion to approve the process of “predictive coding,” which is one of the many terms people use to describe computer-assisted coding.

Well, Judge Peck did just that. As he hinted during his presentations at LegalTech, this was the first time a court had the opportunity to consider the propriety of computer-assisted coding. Without hesitation, Judge Peck ushered us into the next generation of e-discovery review—people assisted by a friendly robot. That set the e-discovery blogosphere buzzing, as Bob Ambrogi pointed out in an earlier post.

I recommend reading the decision (and its accompanying predictive-coding protocol) not for its result but for its reasoning. This is one of the best sources I have seen on the reasons for and processes underlying predictive coding. Indeed, Judge Peck provided a primer on how to conduct predictive coding that is must reading for anyone wanting to get up to speed on this process.

What is Computer-Assisted Review?

Judge Peck started by quoting from his earlier article in Law Technology News:

By computer-assisted coding, I mean tools (different vendors use different names) that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with (i.e. training by) a human reviewer.

As Judge Peck concluded: “This judicial opinion now recognizes that computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.”

Why Do We Need Computer-Assisted Review?

The answer for Judge Peck was simple: Other methods of finding relevant documents are expensive and less effective. As he explained:

  • The objective of e-discovery is to identify as many relevant documents as possible while reviewing as few non-relevant documents as possible.
  • Linear review is often too expensive. Despite being seen as the “gold standard,” studies show that computerized searches underlying predictive coding are at least as accurate as human review, if not more accurate.
  • Studies also show a high rate of disagreement among human reviewers as to whether a document is relevant. In most cases, the difference is attributable to human error or fatigue.
  • Key word searches to reduce data sets also miss a large percentage of relevant documents. The typical practice of opposing parties choosing keywords resembles a game of “Go Fish,” as Ralph Losey once pointed out.
  • Key word searches are often over-inclusive, finding large numbers of irrelevant documents that increase review costs. They can also be under-inclusive, missing relevant documents. In one key study the recall rate was just 20%.

Ultimately, Judge Peck reminded us of the goals underlying the Federal Rules of Civil Procedure. Perfection is not required. The goal is the “just, speedy, and inexpensive determination” of lawsuits.

Judge Peck concluded that the use of predictive coding was appropriate in this case for the following reasons:

  1. The parties’ agreement.
  2. The vast amount of ESI (over 3 million documents).
  3. The superiority of computer-assisted review over manual review or keyword searches.
  4. The need for cost effectiveness and proportionality.
  5. The transparent process proposed by the parties.

The last point was perhaps the most important factor leading to the decision: “MSL’s transparency in its proposed ESI search protocol made it easier for the Court to approve the use of predictive coding.”

How Does the Process Work?

The court attached the parties’ proposed protocol to the opinion. While it does not represent the only way to do computer-assisted review, it provides a helpful look into how the process works.

  1. The process in this case began with attorneys developing an understanding of the files and identifying a small number that will function as an initial seed set representative of the categories to be reviewed and coded. There are a number of ways to develop the seed set, including the use of search tools and other filters, interviews, key custodian review, etc. You can see more on this subject below.
  2. Opposing counsel should be advised of the hit counts and keyword searches used to develop the seed set and invited to submit their own keywords. They should also be provided with the resulting seed documents and allowed to review and comment on the coding done on the seed documents.
  3. The seed sets are then used to begin the predictive coding process. Each seed set (one per issue being reviewed) is used to begin training the software.
  4. The software uses each seed set to identify and prioritize all similar documents over the complete corpus under review. Essentially, they review at least 500 of the computer-selected documents to confirm that the computer is properly categorizing the documents. This is a calibration process.
  5. Transparency requires that opposing counsel be given a chance to review all non-privileged documents used in the calibration process. If the parties disagree on tagging, they meet and confer to resolve the dispute.
  6. At the conclusion of the training process, the system then identifies relevant documents from the larger set. These documents are reviewed manually for production. In this case, the producing party reserved the right to seek relief should too many documents be identified.
  7. Accuracy during the process should be tested and quality controlled by both judgmental and statistical sampling.
  8. Statistical sampling involves a small set of documents randomly selected from the total files to be tested. That allows the parties to project error rates from the sample.
  9. Here, the parties agreed on a series of issues that will, of necessity, vary on other cases. The key point is that the parties agree on the issues and test the coding during the process.

Random Samples

It is important to create an initial random sample from the entire document set. The parties used a 95% confidence level with an error margin of 2%. They determined that the sample size should be 2,399 documents. You can figure this out using one of the publicly available sample-size calculators such as Raosoft, which we often use:

Seed Sets

The protocol goes on to describe a number of ways to generate seed sets including:

  • Agreed-upon search terms.
  • Judgmental analysis.
  • Concept search.

The parties frequently sampled the results from searches to evaluate their effectiveness.

There is at least a good blog post to be written about seed sets. Some computer-assisted coding systems like the one used for this case start their process with seed sets. The notion is that attorneys understand the cases, know what is and is not relevant and can train the system to recognize more relevant documents more effectively than starting with no seed documents.

Others think this is a mistake. They believe that however well meaning, the attorneys will bias the system to find what they think is relevant and get self-reinforcing results. In this regard, they are suggesting that the attorneys will make the same mistakes found in key word searches—thinking that you know which words will be most effective at finding your documents.

Systems following this logic urge the user to start from scratch, telling the system what is and is not relevant based on reviewing documents. As you do that, the system begins developing its own profile of relevant documents and builds out the searches. The belief is that the system may create a better search through this process than it might if you bias it with your seed documents.

There is a middle ground here as well. Many of the latter systems (no seed) will allow you to submit a limited number of seed documents as part of the training process. That may represent the best of both worlds or it may not, depending on your beliefs. The important point is that there are different approaches to computer-assisted processing. This protocol shows you one approach only.

Training Iterations

The process involves a number of computer runs to find responsive documents. The parties started with a first set of potentially relevant documents based on analysis of the seed set. After that review, the computer was asked to consider the new tagging and find a second set for testing. Then a third and a fourth.

The protocol suggested that the parties run through this process seven times. The key is to watch the change in the number of relevant documents predicted by the system after each round of testing. Once that number dropped below a delta of 5%, the parties had the option to stop. The notion is that the system has become stable by that time, with further review unlikely to uncover many more relevant documents.

Finishing the Process

Once the training has completed and the system is “stable,” we move from computer-assisted to human-powered review. At that point, the producing party reviews all of the potentially responsive documents and produces accordingly.

Final QC Protocol

As a final stage, the parties need to focus on the potentially non-responsive documents—the ones the system says to ignore. The parties select a random sample (2,399 documents again) to see how many were, in fact, responsive.

These same documents (non-privileged ones) must be produced to the opposing party for review. If that party finds too many responsive documents in the sample or otherwise objects, it is time for a meet-and-confer to resolve the dispute. Failing that, you can always go to the court and fight it out.

Is This the Bible on Predictive Coding?

Certainly not. There are a lot of ways to approach this process. However, first opinions on any topic carry a lot of weight. We chose a profession that is guided by precedent, and these are first tracks on this new and exciting subject. The suggested procedures make sense to me and provide a starting point for your predictive coding efforts. This opinion and its accompanying protocol are important reading whether you are proposing or opposing the process for your next case.

 

Check for Privilege Before Turning Over Your Database: The Lesson in Thorncreek Apartments

Before you give opposing counsel the keys to your production database, run at least one check on the privilege field to see if any of your documents are marked “privileged.” That is the lesson a federal judge taught a hapless defense counsel in Thorncreek Apartments III v. Village of Park Forest, 2011 U.S. Dist. Lexis 88281 (N.D. Ill August 9,2011). If you don’t, you may be deemed to waive the privilege. I hate when that happens!

Icon of two keys on a keyring“What’s going on here?” you might ask. Can anyone be that sloppy? “Maybe,” I say in response. At least that’s what it seemed like here. Counsel literally made a production database available for more than seven months without once checking to see if it included privileged documents. A waiver is not inadvertent if you were hopelessly sloppy about it. Here is the story.

The Facts

The plaintiffs filed a motion before the District Court arguing that privilege was waived for six documents included in a production database provided by defendant Village of Park Forest. The Village argued that the documents were inadvertently produced in a production database hosted by its online vendor Kroll Ontrack. (I don’t think Kroll did anything wrong here.)

Defense counsel went through what seemed like a reasonable process in pulling files off of a number of tape backups. First, they conducted a key word search to pull back documents that might be responsive. The key words had been agreed upon by opposing counsel and in some cases ordered by the court. (Looks like there may have been some controversy around the key words and no, people weren’t talking about predictive coding in this case.)

As a second step, Kroll put the potentially responsive documents in an online database for defense counsel to review. They did so, marking documents as responsive, non-responsive and privileged.

The third step was to place the documents released by defense counsel in another online database made accessible to plaintiffs’ counsel. As the court noted, documents that the Village elected to withhold from production were not placed in this database. So far it all makes sense.

Producing the Privilege Documents

Here is where the rub came. At some point, plaintiffs complained that they were not able to see the documents returned from the agreed-upon searches but marked non-responsive. The Village had previously said it would include non-responsive documents in the production database in order to show how many its review had identified. In an attempt to be magnanimous, the Village elected to begin placing all of the documents in the production database—responsive and non-responsive.

As the court noted, the parties’ briefs left it a bit “murky” as to how the Village intended to handle the documents counsel had reviewed and marked “privileged.” Why they were not pulled out of the population before the production database was marked live is simply beyond me.

But they were not and 159 privileged documents went online, easily available to plaintiffs’ counsel. Even more surprising, during the seven months the production database was live, Village counsel did not bother to produce a privilege log. At one point, counsel claimed that there were no privileged documents to withhold. Somebody tell me how that happened.

So, as you probably guessed by now, depositions started and some wiseacre slapped two of the juicy privileged documents in front of a witness. Village counsel erupted, claiming privilege and inadvertent production. The game was on.

Game On: Reel Those Privileged Documents Back In

Actually, the game was over, at least for defense counsel. It appears that the parties came to agreement with respect to most of the privileged documents (probably the non-important ones) but disagreed with respect to six of them. Quickly concluding that at least some of the six were privileged, the court thus was required to review the doctrine of inadvertent waiver.

The test for inadvertent waiver is pretty simple, made more so by the recent amendments to F.R.E 502:

  1. Were the documents privileged?
  2. Was the production inadvertent?
  3. Should privilege be waived nonetheless?

The privilege discussion didn’t interest me much. I did all that in law school. So, my attention was focused on the second and third prongs of the test.

Was the Disclosure Inadvertent?

As the court noted, this issue could be wrapped up with the third element, which goes to the heart of forgiveness. Rather than do that, the court espoused a simple analysis for this element.

It simply presumed based on the evidence before it that counsel didn’t really mean to include all of those privileged documents in the production database. And who would?

Certainly counsel wasn’t using those documents in an affirmative way, which was the original reason courts held that the full privilege would be waived. The old, “I did it because counsel told me to do it,” was a key way to waive privilege for that entire subject matter. That didn’t happen here.

Nor did the Village sit silent as opposing counsel was cramming those two juicy privileged documents down their witness’s throat. They got up and objected, allowing the deposition to go forward only under protest.

Surprisingly—and this seemed important to the court—it then took more than four months before Village counsel came up with a proper privilege log to join the dispute. While the court stated this as a dispositive fact, you can just tell it didn’t like counsels’ lackadaisical approach. Neither would I; I wonder what happened there.

Anyway, the court let them off the hook and ruled that the production of privileged documents were inadvertent. On to the next step in the test. (But don’t you take four months to produce your privilege log.)

Reasonable Steps to Prevent Disclosure?

Once the court concluded the production was inadvertent, the Village had two more hurdles to cross:

  1. Did the holder of the privilege take reasonable steps to prevent production of privileged documents?
  2. Did the holder of the privilege promptly attempt to rectify the error after it became known?

This case was about the first prong of that test. The court complained that the Village provided “precious little” about the steps taken to find and isolate privileged documents. Counsel for the Village wrote an email stating that he spent “countless hours” reviewing a “relatively large amount” of documents to find those with privileged content. Why, the court asked, was there no affidavit to support this allegation? Why indeed.

The court had no sympathy for the next argument—counsel thought that marking the document “privileged” would keep it out of the Kroll database. As the court astutely pointed out:

It would have been a simple matter for the Village to check the production database created by Kroll—before it went live online and became available to [plaintiffs]—to verify that privileged documents were not disclosed.

Duh! I am not privy to the Kroll software but I bet there was a simple way to search to see if anything was privileged—either by privilege tag, if that was included in the production database, or by the reference number given to the privileged documents. What happened here?

In a somewhat gratuitous fashion, the court piled on by noting that not a single privileged document was withheld and that no privilege log was produced.

I confess that I am puzzled myself as to how this happened. Counsel went to the trouble of marking 159 documents privileged. Why in the seven months that followed did counsel not ask for a printout from the Kroll system sufficient to produce a simple privilege log? Alas, we case readers don’t get to ask these questions.

As the court concluded—ironically citing yet another case as precedent for the point:

It is axiomatic that a screening procedure that fails to detect confidential documents that are actually listed as privileged is patently inadequate.

Sorry Charlie. You lose on that one.

Failing to Notice and Rectify

The court didn’t stop there. It went on to fail the Village on the second element of the test. It held that the Village failed to rectify the error in a timely way. Again, ignorance in using the Kroll database seemed to be at the heart of the finding.

For starters, the court forgave the Village for taking another four months after the deposition to issue the privilege log. Methinks the court didn’t like that fact at all but just didn’t want to say so.

Instead, the court jumped on counsel for failing to find its own error for over nine months—the period from March to December 2009 when the production database was available to the plaintiffs. According to the court, defense counsel should have logged-in and run a privilege search during this period. The 150+ hits that came back would have been a dead giveaway.

As the court said:

Yet for some none months, the Village apparently had no inkling that the production database contained documents that the Village wished to withhold as privileged, or that [plaintiffs] were reviewing and obtaining those documents. If that is true (and we accept that it is), that means the Village as not paying any attention whatsoever to what documents its opponent in the litigation was selecting from the database. Perhaps [plaintiff] simply selected all of them; the parties’ briefs do not tell us if this is so.  But, even if that were the case, a single visit to the production database could have alerted the Village to the problem.

This seems like piling on to me. Counsel clearly didn’t know much about databases or pay much attention to the process. The process was sloppy or non-existent. The client paid the price.

The court did go on to make one last point that brings us full circle. It noted that the problem might have come to light earlier had Village counsel provided a privilege log, which was its duty in the first place. Doing that might have forced plaintiff’s counsel to acknowledge what it knew—that the database had a bunch of privileged documents. But with no privilege log and a seeming statement that there were no documents being withheld on privilege grounds, all bets were off. Plaintiffs’ counsel could sit quietly until the deposition and then drop the bombshell on a hapless witness.

What Can We Learn From This?

This is a simple-enough case with a simple-enough message: Don’t produce documents in an online database without doing some basic checking first. Assuming, as the case does, that there was a field containing a privilege designation, it would take counsel milliseconds to realize that something bad was about to happen. If lead counsel wasn’t comfortable running the search, how about that tech-savvy associate or legal assistant? If not them, how about your friendly vendor? If asked, the Kroll people could have spotted the mistake. Some might say they should have but not me.

We built an automated production system that can be run by our clients with no Catalyst intervention. Rather than allow these kinds of mistakes to happen, we added a QC rule-set that will not allow a document marked “privileged” or “potentially privileged” to be produced without a specific override. Even if foldered for production, these documents are pulled out into a special folder that must be addressed by the client before a production can go through.

These rules are just another step in trying to make the process easier and more foolproof for lawyers who are not comfortable with technology. It doesn’t guaranty that a privileged document will never be produced—we have seen cases where the documents are marked privileged after they are produced—but it can cut down on mistakes. With the stakes (and the volumes) this high, you have to do everything you can to avoid an inadvertent waiver.

This is not a case where counsel took dozens of steps to avoid privilege yet something slipped through, as I have written about before. (See, Bad Facts Make Bad Law: ‘Mt. Hawley’ A Step Backward for Rule 502(b).) Rather, it is about a simple mistake that anyone could have caught with just a smidgen of effort. I don’t feel as bad for Village counsel as I do for some of the other victims. This case is no “derelict on the waters of the law,” it is a fair ruling on somewhat extreme circumstances. Counsel were sleeping on the job (or so it seems to me) and paid the price.

So, the lesson here? Don’t produce those documents without checking to see if privileged files might have snuck through. Run some searches, sample some documents and for God’s sake check the privilege field. You will sleep better, and pay lower malpractice premiums, if you do.

 

‘Mt. Hawley’ Affirmed and Claim Dismissed: District Judge Again Puts His Stamp of Approval on Troubling Rulings

For over a year, we have been writing about a West Virginia decision (and its progeny) that we believe went too far in making new e-discovery law. The original decision, issued May 18, 2010, was styled Mt. Hawley Insurance Co. v. Felman Production. You can read my original post at: Bad Facts Make Bad Law: ‘Mt. Hawley’ A Step Backward for Rule 502(b).

In that decision, Magistrate Judge Mary E. Stanley held that Felman had waived attorney-client privilege by inadvertently producing a smoking-gun email to counsel suggesting that it might be helpful to their insurance claim for business interruption to backdate several orders from clients. If the orders had come in while the machinery in question was under repair, that might provide support for their $38 million dollar insurance claim. You have to love their chutzpa at the very least.

The 'smoking-gun' email involved a furnace such as this one, at Felman's West Virginia facility.

In my original post, I suggested that bad facts (outright fraud it seemed to me) might be responsible for what I thought was bad law. After all, the production had been overseen by a highly reputable law firm (which had no involvement in this email). Counsel had not only been diligent in trying to screen out privileged documents, but it had gone far beyond what we have typically seen elsewhere. Indeed, counsel cited over 20 steps they had taken, including a variety of review and sampling efforts:

  1. Negotiated the ESI stipulation with defendants.
  2. Hired an ESI collection vendor, Innovative Discovery.
  3. Discussed with Felman’s IT department the company’s computer network structure and identified potential sources of relevant ESI.
  4. Visited Felman’s West Virginia plant to coordinate and oversee ESI collection.
  5. Decided to collect data using forensic imaging.
  6. Directed the vendor to collect ESI from the current server and the backup server.
  7. Collected 1,638 gigabytes of data.
  8. Downloaded emails from 29 custodians for processing by its law firm, Venable.
  9. Hired a new vendor to process Felman’s Oracle and Soloman databases.
  10. Identified the first six workstations to be processed and learned that each contained more data than anticipated.
  11. Examined methods to cull non-relevant materials.
  12. Selected search terms to retrieve documents responsive to defendants’ document requests.
  13. Tested the search terms against the Felman emails and added additional search terms.
  14. Tested the search terms, including the additional terms, against the Felman emails, tagged responsive documents, and set them aside for privilege review.
  15. Produced 17,064 Excel spreadsheets.
  16. Selected privilege search terms to identify materials which are potentially privileged and relevant.
  17. Set aside potentially privileged materials for individualized document-by-document review for relevancy and privilege.
  18. Tested the privilege search terms against Felman’s emails.
  19. Retrieved native files of all images and examined thumbnails.
  20. Conducted “eyes-on” review of all documents identified both as relevant and potentially privileged.
  21. Decided to use a vendor to complete the processing of Felman’s emails.
  22. Produced ESI in native or TIF format, with 36 fields of metadata.
  23. Produced more than 346 gigabytes of data without sampling for relevancy, over-inclusiveness or under-inclusiveness.

Counsel got nailed simply because several of the Concordance indexes they used turned out to be corrupt. As a result, privilege searches didn’t turn up anything for the documents in those indexes. Since counsel didn’t attempt to review every one of the millions of documents they produced, several key documents slipped through the net. Reading between the lines, Magistrate Judge Stanley seemed to lay blame on the fact that they did not appear to sample the documents that they produced but never reviewed to see if any might be privileged.

We wrote about subsequent decisions as well as other commentary about the decision here:

Now, in what has to be the final straw in this saga, the presiding judge in the case, U.S. District Judge Robert C. Chambers, has taken the ultimate step by issuing further sanctions and dismissing the lost-business claim: Felman Production v. Industrial Risk Insurers, 2011 U.S.Dist. Lexis 112161 (Sept. 29, 2011).

Let us look at the court’s reasoning.

Bad Discovery Practices

U.S. District Judge Robert C. Chambers

You can’t read any of the Mt. Hawley decisions without being reminded that both the magistrate judge and the district judge were not happy with Felman’s discovery practices. Among others we have chronicled, you won’t win favor with the courts by backing up the trucks and dumping a ton of irrelevant electronic files on the opposition. When you go the next step and make every one of them confidential, regardless of content, you only worsen your position. The judge was particularly incensed to see pictures of kitties with a big confidential stamp on them. Kitties. Yes kitties. Awww, how cute. Oh, but there were a couple of naked men in the photos too (although not with the kitties, thank heavens).

There were also missing files and no real attempt to issue a litigation hold by Privat, the Ukrainian company that was at the controls in this case. Indeed, it appears that party Felman actively dissembled with respect to its true owners, a fun loving bunch of Ukranians with little respect for the discovery process. They got caught when one of the inadvertently produced documents showed that they were running the show. Oh what a tangled web they weaved!

As Judge Chambers explained:

Felman’s failure to comply with Judge Stanley’s August 19 and October 19, 2010 Orders was inevitable in light of the lack of care Felman exercised. … Felman did not provide litigation hold memos to the West Virginia Felman staff until four months after this case was filed. Felman also admitted that the Ukrainian custodians were not instructed to preserve their documents.

This led to the destruction of documents when the Privat representatives sold their computers before receiving a document request. Convenient, to say the least.

All this led to a motion for sanctions—either a dismissal outright or dismissal of the business interruption claims plus an adverse inference instruction regarding the missing documents.

The judge made short order of the motion. While not dismissing the case outright, he did dismiss the $38 million business interruption claim. He started by discussing spoliation, citing Magistrate Judge Grimm in Victor Stanley, Inc. v. Creative Pipe, Inc., 269 F.R.D. 497, 522 (D. Md. 2010).

A party subject to [the duty to preserve] must “identify, locate, and maintain information that is relevant to specific,predictable, and identifiable litigation.” In ascertaining whether a party has fulfilled its duty to preserve, a court must “determine reasonableness under the circumstances … [which] in turn depends on whether what was done—or not done—was proportional to that case and consistent with clearly established applicable standards.

The court went on to find specific culpability—gross negligence—in Felman’s failure to issue litigation holds and otherwise take steps to preserve evidence.

In the end, the court didn’t dismiss all claims but rather threw out the big one—for business interruption. It left the other claims and counterclaims to be tried.

The sanctionable conduct of Felman and the resulting prejudice to Defendants merits dismissal fo the business interruption claim because the unavailable evidence and improper conduct related predominantly to it. As to Defendant’s counterclaim, the unavailable evidence is less prejudicial and an adverse inference instruction is an adequate remedy.

The court also went on to award attorneys’ fee in the bargain.

What Happens Next?

My guess is that we won’t hear any more from the parties in this case. The business interruption claim was the heart of Felman’s demand in the first place. An adverse inference instruction is a powerful tool to address the rest of the claims, at least to the extent Felman oversteps the bounds of its insurance policy. Settlement is the likely next step in this matter.

What about a malpractice claim? Given the Felman party antics, I wouldn’t put it past them. They might claim malpractice for producing the damaging documents. With a bogus claim for $38 million at stake, who knows what they might do. At the least, this would present interesting evidentiary questions for the firm’s malpractice carrier. I wouldn’t want to be in the settlement meeting.

But my concern is for other cases more than for how this one wraps up. With 23 steps being ruled as not enough, what will be adequate? Perhaps the problem could have been remedied by some simple sampling procedures, but that isn’t clear from anything I read. Perhaps enough other judges will choose to ignore it so that it becomes weak precedent. “A derelict on the waters of the law,” as famed Supreme Court Justice Felix Frankfurter once said. That’s my vote. Send the bad guys home but leave e-discovery law alone.

New Model E-Discovery Order for Patent Cases Turns Fishing Expeditions into Games of ‘Go Fish’

Two weeks ago in his speech at the East Texas Judicial Conference, Chief Judge Randall R. Rader of the Federal Circuit Court of Appeals announced approval of An E-Discovery Model Order for patent cases. While not the first model order to hit the scene, this one could have far-reaching implications—not only for patent disputes but for other civil cases in federal and state courts.

As Judge Rader explained (and we all know), e-discovery in the digital age has gotten expensive. It turns out the situation is even worse for intellectual property disputes. One study suggested that this class of cases costs 62% more than the norm. At Catalyst, we host a lot of IP cases, so I don’t doubt those numbers one bit.

To help address this problem, the Advisory Council of the Federal Circuit created a special subcommittee to draft a model order governing e-discovery, which the Advisory Council then unanimously approved. One of the drafters’ goals was to limit the ability of litigants to turn discovery into an “unlimited fishing expedition.” In doing so, however, they may have gone too far. By limiting each party to five search terms per custodian, they have turned the search process into a game of Go Fish. At least that is the way it looks to me.

(For an earlier take on how to keep e-discovery from becoming like a child’s game, see Ralph Losey’s 2009 post, Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search.)

I support efforts to limit discovery costs and agree they have gotten out of hand. However, the purpose of discovery is to help the parties get at the truth. Turning the search process into a game of five questions would not seem to be a step in that direction. To the contrary, it ignores the most basic tenet of effective search–that it must be pursued in an interactive fashion. With the new Model Order, I fear the courts are turning the search part of discovery more of a blind guessing game than a mechanism to find the documents that matter to your case.

What’s Right About the Model Order?

To be sure, there are a lot of good provisions in the Model Order, although I have questions about a few of them . Let me start with these and then turn to my concerns about the search limitations.

The order imposes cost shifting for disproportionate ESI production requests. It goes on to say that a party’s dilatory discovery tactics and, conversely, its efforts to promote efficiency and reduce costs will be considered in cost-shifting decisions. (Model Order Paragraphs 3 and 4.)

Intellectual Property Institute

Chief Judge Rader

This makes sense to me. The party demanding production can best determine that the information being requested outweighs the costs of producing it. However, somebody will have to determine when a discovery request is disproportionate.

General ESI productions will not include metadata other than date and time sent and received and distribution list. (Paragraph 5.)

I have some reservations about this one but generally agree. Outlook includes hundreds of metadata fields that are irrelevant in 99.9% of the cases. However, I am not sure how much this will save. The cost to extract a few fields is about the same as extracting all the fields. Still, it is mostly clutter so I like the idea. Simplicity is good in most cases.

General ESI production requests shall not include email. To obtain email, parties must propound specific email production requests. (Paragraph 6.)

This one surprises me. Email is at the heart of most disputes and ties heavily into prior art investigations. In Qualcomm, email not produced was at the heart of the sanctions hearings. Why is email singled out for special treatment here? I presume it is because of sheer volume.

Email production requests shall be propounded only for specific issues, rather than general discovery of a product or business. (Paragraph 7.)

This should prove interesting. I suspect a good lawyer will find a way to tie those broad email requests to “specific issues” which may render the limitation toothless.

Email production requests shall be phased to occur after the parties have exchanged initial disclosures and basic documentation about the patents, the prior art, the accused instrumentalities, and the relevant finances. (Paragraph 8.)

This is an interesting suggestion. I support efforts to keep discovery focused on key issues and I assume the goal is to keep initial productions to a minimum in the hopes that later email productions won’t be required.

Email production requests shall identify the custodian, search terms, and time frame. (Paragraph 9.)

This makes sense to me but it suggests that the parties will be able to define what search terms will bring back the relevant information. Without looking at the emails themselves, this will be difficult. Why not allow the parties to specify specific parties and review all of their communications during the relevant time frame? The chances of missing key discussions because you didn’t come up with the right search terms seems high to me.

Each requesting party shall limit its email production requests to a total of five custodians per producing party for all such requests. (Paragraph 10.)

When I was trying cases, I loved limits on depositions because they added a new tactical dimension to the game of preparing for trial. In a typical multi-party case, I would start by guessing which people the other parties would seek to depose and then try to maximize my reach by picking the ones that others didn’t pick. It was fun and didn’t really seem to get in the way of our search for justice.

Doing it for an email request seems both similar and different. In a multi-party case, will co-counsel get together and pick five different custodians each just to widen the net? Why impose such a limit? Why not make the parties justify their requests, perhaps setting a presumptive initial limit? Five seems like a small number to me.

The Model Order does say that should a party request email from more than five custodians, it shall bear the costs of the additional production. That sounds good but will doubtless spawn a host of follow-on debates about the reasonableness of the claimed costs.

Under FRE 502(d), the inadvertent production of a privileged document will not be deemed a waiver. Significantly, the party cannot use the inadvertently produced document to challenge the privilege or protection. (Paragraphs 12 and 13.)

This seems like a step in the right direction if it means the courts will take a harder line in waiver. In a number of recent decisions, the courts seem to have ignored the purpose of the recently amended FRE 502 and put counsel through a gauntlet of tests to justify retrieval of the inadvertently produced materials. See my comments on the Mt. Hawley decisions as on example.

In an effort to perhaps strengthen the safe harbor envisioned for FRE 502, the Order goes on to state:

The mere production of ESI in a litigation as part of a mass production shall not itself constitute a waiver for any purpose. (Paragraph 14.)

That will have some interesting consequences but should reduce some of the risk in producing ESI that hasn’t been reviewed multiple times before the production.

What’s Wrong with the Order

Here is my main concern with the Model Order. It states:

Each requesting party shall limit its email production requests to a total of five search terms per custodian per party. (Paragraph 11.)

Five search terms per custodian? For a major IP case? That number sure seems low to me, almost arbitrary. What are they trying to accomplish with this sort of limitation? To narrow down the number of documents produced in the case?

This approach might backfire by encouraging tactics that would broaden rather than narrow the results. If it were me, I would make my five search terms as broad as possible so as to bring back the proverbial kitchen sink. Without any other means to identify and isolate relevant documents, there is no other choice.

The five term restriction is even tougher because of the following additional limitation:

A disjunctive combination of multiple words or phrases (e.g., “computer” or “system”) broadens the search, and thus each word or phrase shall count as a separate search term unless they are variants of the same word.

This is troubling. Search experts routinely use different and varying words to describe a concept or event. They connect them with “or” statements.

Why? Because people use different and varying words when they communicate. Putting aside simple misspellings, search experts know that they have to account for two key concepts in language:

  • Polysemy: The fact that a word can have many meanings. “Strike,” for example, might signal a labor event or what happened to Alex Rodriguez in the bottom of the ninth to close the Detroit series.
  • Synonymy: The fact that many words can mean the same thing. We may file a case, a lawsuit, a claim, an action, a complaint, a petition and so forth.

Add to this the fact that we live in age of texting and Twitter. People no longer feel the need to use the King’s English. We regularly clip our sentences with shortcuts. “LOL,” “How R U?” and “Gr8” are common in electronic communications.

That is one reason key word searches fail. It is too easy to miss one of the many possible combinations used by your targets. The only remedy is to pepper your searches with as many “or” clauses as possible.

Under the Model Order, that would no longer be allowed. If your first search has five “or” clauses and you can’t convince the court that these are “variants of the same word,” you will be done. One search (an incomplete one at that) and out.

My second problem is that the proposed key word methodology seems flawed and almost backwards. Simple key word searching has been known to be ineffective for decades. Back in 1985, Blair and Maron conducted a study of the effectiveness of key word searches and found they only retrieved 20% of the relevant documents. See David C. Blair & M. E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, 28 Com. ACM 289 (1985).

During the TREC programs, researchers found that the key to making searches more effective was to interact with the documents. Some Lessons Learned To Date from the TREC Legal Track (2006-2009) (Feb. 24, 2010). Specifically, the researchers had to sample initial results from key word searches, refine their searches and sample again. Rinse and repeat techniques managed to bring results up to 50% effectiveness, which still isn’t all that good.

The most effective way to use key word searches is to combine them with different techniques. In one study, the author suggested that this approach could bring up the retrieval rate to about 78% of the total number of relevant documents. See: In Search of the Perfect Search.

Five search terms, with no sampling and limited “or” clauses? That doesn’t sound right to me. I think the courts should be going in the other direction. Encourage the parties to work together and sample the document population. Help them frame their searches so that they find the most relevant documents and only the most relevant.

Otherwise, we have let our fear of fishing expeditions convert the process to a game of Go Fish. “Give me all your eights,” counsel demands. “Go fish,”  opposing counsel replies.

For what it is worth, there is an “out” clause here. The Model Order states:

Should a party serve email production requests with search terms beyond the limits …, the requesting party shall bear all reasonable costs caused by such additional discovery.

What does this mean? If I want to submit a more effective search I can pay for it? Is this like buying a vowel on Wheel of Fortune? Perhaps it is a disguised cost-shifting mechanism.

File:Wheel of Fortune (US game show) title logo.jpgThe odd thing is this: For most systems, it doesn’t cost any more to run a long search with multiple “or” statements than a single term. After all, this is computer time we are talking about and not human effort. Catalyst allows searches up to 60,000 characters, for example, and that covers a lot more than five terms. Why not allow counsel to build at least five comprehensive searches before looking at cost shifting?

There is also another out. Under the Model Rule, the parties may jointly agree to modify the five-term limit without having to ask the court’s permission. The key is the requirement that the parties “jointly agree.” In my experience, one side always has an interest in being less helpful so reaching agreement may prove difficult in many cases. However, it may be that counsel will simply make expanding the number of searches allowed a standard part of their pretrial scheduling order.

Even with those outs, the default under the Model Rule is to force counsel to choose between games. Wheel of Fortune or Go Fish? These are not my preferred modes for modern e-discovery. It will be interesting to see how the Model Order works out in practice.

Shedding Light on an E-Discovery Mystery: How Many Documents in a Gigabyte?

In his article, “Accounting for the Costs of Electronic Discovery,” David Degnan states that conducting electronic discovery “may cost upwards of $30,000 per gigabyte.” (You can read Bob Ambrogi’s post about it here.) That is a lot of money for discovery, particularly considering that the number of gigabytes we are seeing per case seems to keep increasing.

Much of Degnan’s analysis turns on how many documents (files actually) you can expect to find in a gigabyte of data. As he points out, review costs make up nearly 60% of the total costs for e-discovery. If a reviewer can only get through an average of 50 documents per hour (as Degnan suggests), the number of documents likely to be found per gigabyte of data becomes important to understanding the costs of electronic discovery.

Degnan’s View

So, how many docs are there in a gig? In Table 3 of his article, Degnan provides the following range of figures:

Degnan cites the Clearwell cost calculator for the proposition that the range goes from 5,000 to 25,000 documents per GB. He also refers to an article by Chris Eagan and Glen Homer for the proposition that 10,000 documents per GB is the industry standard.

Neither estimate is backed by hard data but we have seen these kinds of ranges before. The EDRM group, for example, posted the following as “industry averages” for images and files per GB.

No basis for these ranges is given.

Dutton’s View

E-discovery consultant Cliff Dutton included this question in a survey he submitted to a number of law firms, corporations, consultants and software providers. Specifically, he asked: “What is the average number of documents (post culling) per GB collected from all sources?” See: eOPS 2010: Electronic Discovery Operational Parameters Survey (PDF posted with author’s permission).

Dutton’s survey resulted in figures that would support the low end of Degnan’s range. The mean (average) response to his survey was 5,244 documents per gigabyte. The median response was only a bit higher at 5,500 documents per gigabyte.

That leaves us with a pretty broad range. Dutton’s figure of 5,200 is only half the so-called industry standard of 10,000 documents per gigabyte. In turn, that figure is less than half of the upper limit of 25,000 documents per gigabyte suggested by Degnan.

So what’s the right number?

Catalyst Data

I have been interested in this question for quite some time but never did anything to pin down the answer. When pricing is discussed, clients are naturally interested in knowing how many documents to expect from their collection efforts. Since collections are typically measured in gigabytes, the “How many documents?” question comes up all the time. So, I decided to take a look at our own data.

We started by taking data from a handful of our cases. This survey was not meant to be scientific but we did look for cases with different types of data. For example, one of our clients sends us a lot of native PDF files. These are not scanned images but rather postscript files and thus their sizes tend to be small. Other cases involved Outlook, Word, Excel and the other Office-type files that you expect from e-discovery. We expect some variance among cases that hold different types of data.

We grabbed our initial data from nine cases, chosen pretty much at random. In total, the cases had just under 8 million files with a total of 1.6 terabytes of storage. Of course, I am not using real case names so I’ve labeled them cases 1 through 9.

I should also note that, following the international standards bodies including NIST, we used 1,000 rather than 1,024 as the standard for converting bytes to gigabytes. Thus, we count 1 billion bytes as a gigabyte rather than 1,073,741,824 bytes. You can read about this debate here. If you prefer to use 1,024 as your multiplier, the difference is about 7 percent. Thus, 5,000 files per GB using the 1,000 multiplier would be 5,370 files per GB based on a divisor of 1,024 (bytes/1024x102x1024).

With all that behind us, here is what we found (with the cases sorted based on files per GB):

This seemed interesting. The bottom line average (total files divided by total GBs) of 4,890 was on par with the 5,000 lowest end of the “accepted” range reported by Degnan and nearly matched the 5,244 average found by Dutton’s survey. Our median was just a bit lower at 4,522. For those, like me, who almost failed statistics, the median value says that half the cases had more than 4,522 files per GB and half had fewer.

I found the disparity between the number of files per GB in the different cases interesting as well. Some cases had very low counts—1,500 files per GB—while others stretched as high as 14,000 files per GB. You can see the variation here:

Because Case 9 was the largest outlier, I took a look at the specific files stored there to see what I could learn about how file types played into these calculations.

This case had a lot of GIFs—possibly logos on email messages that were treated as separate attachments. It also had a lot of JPG files, which also could be logos. Should these be included in our calculations? I could argue the point either way but it does tell us that the composition of the files is an important factor.

The other types of files included on this site had high files-per-GB counts as well. PDFs led the way with 21,000 files per GB but counts for the Word documents also were high with 12,000 files per GB. I suppose that is what makes it an outlier.

Case 8, the other site with a high file-per-GB count, was a PDF site. The client sent us postscript PDF files (native rather than a scanned image). You can see that the file sizes were rather small:

I was not surprised to see that the text files would have a high file-per-GB rate. The postscript PDF files, which were largely emails, were also relatively small.

Case 1 was at the other end of the spectrum. The file distribution looked like this:

Across the board, the files per GB are dramatically lower. Other than to analyze the content of individual files (or just to recognize that file sizes can vary depending on content), I don’t have a reason for the variances. I include it simply because it is interesting.

Here is a summary by file type across all nine sites:

Naturally, these numbers tie into the averages presented earlier. But this analysis also shows that the numbers can fall well below the 5,000 files per GB that Degnan suggested as the low end of the range. In particular, Excel and PowerPoint files average far lower than other file types.

Enlarging the Survey

With my curiosity piqued, I asked our team to look at additional cases. This time, we looked at 20 more cases of different sizes, again chosen pretty much at random. In total, the cases had just over 10 million files with a total of 3.8 terabytes of storage. Here is what we found:

This time, the values were quite a bit lower than in the first sample. The average number of files per GB (total files divided by total GBs) was about half of the lowest figure in the accepted range reported by both Degnan and Dutton. The median was a bit lower at 2,421.

Once again, we saw a range of values across the cases. You can see the variation here (you can also see that we presented this batch of cases in order):

Since it was relatively easy to do, I combined all of the data from the 29 cases to see what that would show me. Here is what I found:

Needless to say, these figures are quite a bit below the ones others have been throwing around. If the true average is closer to 3,300 files per GB, the industry standards will certainly need adjusting.

Normalizing the Data

Given that we have some statistics expertise in our Search and Analytics Consulting Group, I thought I would do a little more analysis of the data we had uncovered. We started by ordering the data by size, from the lowest files per GB to the highest. Here is how it sorted out:

We then decided to remove outliers from the sample as indicated in the chart. This is a common practice as statisticians review sample results. Figures that are way low or way high might throw your analysis off and provide misleading data.

You can see the ones we removed (and agree or not agree with the practice). The impact in doing so changed the figures to an average of 2,544 files per GB with a median of just a bit less—2,296.

We also calculated the standard deviation to be 1,258 documents per GB. I am no expert on this but a standard deviation is a reference to the famed bell curve used to normalize a set of survey results. You can read more about standard deviation from Wikipedia. It is often denoted by the sigma character “σ” and is at the heart of the Six Sigma programs for reducing manufacturing defects.

The goal with a standard deviation analysis is to try to get a handle on how much variance you might expect when you survey future document populations. If your document population is fairly homogenous and follows a normal distribution, you can expect that 68% of the larger document population will fall within plus or minus one standard deviation from your mean (average). At two standard deviations, you can expect that 96% of your documents to fall.

I am not in a position to say that the mean we calculated is representative or that e-discovery document populations are sufficiently homogenous to follow a normal distribution. However, we can take the data we obtained and fit it to a normal distribution curve. It looks like this:

This graph suggests that 96% percent of the population will fall well under 5,000 documents per GB with only the extreme outliers going beyond it.

So How Many Docs in a Gig?

What can we make of these surveys? It certainly suggests that the average number of files per GB is well under the 10,000 figure cited as the “industry standard,” let alone the higher numbers. Our belief is that the true figure is well below even the 5,000 files per GB that Cliff Dutton reported. If so, that could impact a lot of e-discovery estimates.

I am not a statistician and the statistics professor we work with is in France on sabbatical. Nonetheless, we did look at over 18 million e-discovery files that totaled over five terabytes of storage. That is a pretty solid number from which to draw conclusions.

At the same time, let me be clear that the numbers will vary dramatically depending on the types of documents you have. With simple text files, which are rare in native e-discovery populations, the numbers could skyrocket. With postscript PDF files (converted email or short Word files for example), we saw values go well above the 10,000 mark. In contrast, with some of the other file types, PowerPoint and Excel for example, your numbers could go well below our calculated mean of 2,500 documents per GB.

Perhaps others have done more sophisticated analyses that they can share. I offer our figures just to get the discussion started. Let me know what your experience has been with your files.

I want to thank my fellow Catalyst employees, Greg Berka, Kevin Hughes and Nirupama Bhatt, for their help on this article.

Predictive Coding: One Grumpy Old Competitor Speaks Up

Last week, Law Technology News reporter Evan Koblentz called me to ask about a new patent issued to Recommind for a method of “predictive coding.” At the time, I had only glanced at the patent and told the reporter that I was in no position to comment on its substantive claims over the phone. I did wish Recommind well with its patent and its business—just as I would with respect to other competitors.

As background, I also explained that getting a patent awarded was not the end of the process. Rather, to enforce the patent, one has to meet a number of additional challenges, including proving that the patented device or process was new and innovative. A patent based on works or ideas already in circulation, often referred to as “prior art,” is subject to challenge and revocation.

In response to further questions, I told the reporter that we were “puzzled” as to how a company could get a patent involving a process that had been around academia for more than 40 years. Before the call, I had spoken with Dr. Jeremy Pickens, our Senior Applied Research Scientist, to ask him for his thoughts on the patent. Prior to joining Catalyst, Jeremy’s research at the FX Palo Alto Lab led to six patents in the field of search and information retrieval, including two for collaborative exploratory search systems.

Jeremy had taken a quick look at the patent and wondered how it got it through. You can read his comments about prior research and the state of the industry.

I had to laugh when, after the LTN article came out last week (Recommind Intends to Flex Predictive Coding Muscles), my comments were interpreted as “grumpy” by the good folks at Above the Law.

I found myself smiling because I have been called a lot of things but never grumpy. And, other than being an interested observer, I didn’t feel happy or unhappy about the Recommind patent. As I said to Evan Koblentz, I wish the Recommind people well with their patent and their business—they are doing a lot of exciting things in the industry and deserve their success.

But, for the record (as I used to say when I was a lawyer), the concepts and processes underlying predictive coding are not new. Perhaps Recommind has added a new wrinkle to the process but not much more than that, so far as we can see.

Who Invented ‘Predictive Coding’ Anyway?

The phrase “predictive coding” isn’t new in the industry and was not coined by Recommind. Even before Recommind filed the application for its patent, the Bank of America had already filed an application for a patent on “Predictive Coding of Documents in an Electronic Discovery System” (with the provision application filed on March 27, 2009). Others have used the term for a variety of processes as well. For examples, just Google the phrase.

Although Recommind tried to bull its way through a trademark for the term, the effort failed. As Evan Koblentz later reported on the ALM blog EDD Update, the government rejected Recommind’s attempt to trademark a phrase that was descriptive and already in use by others. (Ironically, the same government agency that granted the patent rejected the trademark.)

Goodbye trademark.

More to the point, the techniques behind predictive coding aren’t new. As Dr. Pickens points out in his post, they go back to the 1970s when search scientists introduced the concept of “relevance feedback” into the lexicon. They realized the simple truth that computer-based search algorithms could be made more effective through an iterative process involving human feedback. And, that work has continued to evolve over the past 40 years.

So, we remain puzzled as to how Recommind could claim a patent around the work of so many others.

The inventor of the Internet.

Ultimately, we are observers in this process because we use different math and techniques than Recommind and many of the others in the market. Specifically, we were one of the first to use a more modern set of algorithms, called non-negative matrix factorization, to analyze document themes and similarities.

This technique, developed at the Massachusetts Institute of Technology, is used widely for facial recognition as well as text analysis. We work closely with Dr. Michael Berry from the University of Tennessee’s Center for Intelligent Systems and Machine Learning, who is a leading proponent of the technique for mathematical search analysis.

Not Grumpy Here

So, we at Catalyst are certainly not grumpy—life’s good here. Nor are we unhappy about Recommind’s patent. Like other bystanders, we will enjoy watching Recommind try to enforce its patent against whomever they suspect might be using similar techniques. I am sure it will make for a great show, perhaps even worthy of truTV.

For the record, we didn’t invent predictive coding or the techniques around relevance feedback. Nor did Recommind. Check with Al Gore. Maybe he did.

 

NIST Issues Draft Recommendations on Cloud Computing

Earlier this month, the Computer Security Division of the National Institute of Standards and Technology (NIST) issued draft recommendations on cloud computing (PDF). As many of you know, NIST is an agency of the U.S. Department of Commerce. Founded in 1901, the agency was the nation’s first physical science research laboratory.

In the e-discovery field, we know it better for its list of 65 million hash values of system and program files (the “NIST” list). We use the list to remove unwanted files before we process documents and other data. The NIST list is the gold standard for our industry and we use it every day.

NIST is involved in many other areas of inquiry, including the International System of Units (as discussed in my recent post, How Many Bytes in a Gigabyte? My Answer Might Surprise You). It also recently issued draft guidelines on security and privacy in cloud computing and launched the NIST Cloud Computing Collaboration wiki to encourage collaboration in refining its cloud standards.

What is Cloud Computing?

In the 84-page draft, Cloud Computing Synopsis and Recommendations, published May 12, the NIST team set out to write a primer on the cloud—types, deployment models, service models, cloud security and, ultimately, the benefits of cloud computing. They start with NIST’s definition of cloud computing, which is tricky because:

Cloud computing is not a single kind of system, but instead spans a spectrum of underlying technologies, configuration possibilities, service models, and deployment models.

Thus, while the term “cloud” is often used as a synonym for the Internet, cloud computing means more than simply the transmission of data over the Internet.

Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

According to the NIST definition, cloud computing has five essential characteristics:

  • On-demand self service.
  • Broad network access.
  • Resource pooling.
  • Rapid elasticity.
  • Measured service.

Following this logic, one could argue either way for many of the e-discovery providers who bill themselves as cloud providers. While they may offer a hosted product via the Internet, they may not meet NIST’s requirements for on-demand self service, resource pooling and rapid elasticity.

There are several service models for cloud computing, each with different strengths and weaknesses:

  1. Cloud Software as a Service (SaaS): Cloud e-discovery providers would fall under this category. They offer a product accessible via a browser but manage the underlying infrastructure including network, servers, operating system, storage and applications.
  2. Cloud Platform as a Service (PaaS): This allows consumers to deploy their applications on top of a cloud infrastructure.
  3. Cloud Infrastructure as a Service (IaaS): Consumers essentially rent the infrastructure but determine their own software and even the OS they will use.

NIST's depiction of how control is shared in a SaaS model.

There are also four different deployment models for cloud computing:

  1. Private cloud: This refers to infrastructure that is operated solely for one organization. It may be managed by a third party but is dedicated to that purpose.
  2. Community cloud: In this case, a group of users provision a cloud infrastructure for a common purpose.
  3. Public cloud: Here, the infrastructure is made available to the general public, although owned by the organization selling the service.
  4. Hybrid cloud: This would be a combination of two or more clouds (private, community or public) that are connected by technology that allows data or application portability.

Why Read the Guidelines

If you are considering the cloud for any of your applications, this is a helpful document. The authors discuss operational characteristics, standards for service-level agreements and security considerations. Ultimately, they talk about the benefits of cloud computing and why organizations like law firms and corporations businesses might consider it.

Cloud computing is relatively new to the legal community, as it is to the rest of the business world. Why use it? Here is the NIST view:

In outsourced and public deployment models, cloud computing provides convenient rental of computing resources: users pay service charges while using a service but need not pay large up-front acquisition costs to build a computing infrastructure. … By using an elastic cloud, customers may be able to avoid excessive costs from overprovisioning, i.e., building enough capacity for peak demand and then not using the capacity in non-peak periods.

Earlier this year, we dumped our Exchange servers in favor of Gmail (via Google Apps). There was some grumbling at first but the transition was a success. The service has worked as well as Exchange, the product is continually updated and we don’t have to worry about hardware or software upgrades. Although email is critical to our business, it isn’t one of our core services. So why run it ourselves? Turns out we don’t need to and we get the added benefit of Google Docs, Google Calendar and other features.

Is it right for you? I would give it a good look the next time you think about upgrading or switching providers. It is the way the computing world seems to be going.

As for NIST’s draft guide to cloud computing, the agency is seeking comments from the public. The U.S. government’s CIO has asked NIST to lead federal efforts on developing standards for data portability, cloud interoperability and security. The goal, according to NIST, “is to help the federal government reap the benefits of cloud computing.” Comments must be submitted by June 13.