Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

Shedding Light on an E-Discovery Mystery: How Many Documents in a Gigabyte?

In his article, “Accounting for the Costs of Electronic Discovery,” David Degnan states that conducting electronic discovery “may cost upwards of $30,000 per gigabyte.” (You can read Bob Ambrogi’s post about it here.) That is a lot of money for discovery, particularly considering that the number of gigabytes we are seeing per case seems to keep increasing.

Much of Degnan’s analysis turns on how many documents (files actually) you can expect to find in a gigabyte of data. As he points out, review costs make up nearly 60% of the total costs for e-discovery. If a reviewer can only get through an average of 50 documents per hour (as Degnan suggests), the number of documents likely to be found per gigabyte of data becomes important to understanding the costs of electronic discovery.

Degnan’s View

So, how many docs are there in a gig? In Table 3 of his article, Degnan provides the following range of figures:

Degnan cites the Clearwell cost calculator for the proposition that the range goes from 5,000 to 25,000 documents per GB. He also refers to an article by Chris Eagan and Glen Homer for the proposition that 10,000 documents per GB is the industry standard.

Neither estimate is backed by hard data but we have seen these kinds of ranges before. The EDRM group, for example, posted the following as “industry averages” for images and files per GB.

No basis for these ranges is given.

Dutton’s View

E-discovery consultant Cliff Dutton included this question in a survey he submitted to a number of law firms, corporations, consultants and software providers. Specifically, he asked: “What is the average number of documents (post culling) per GB collected from all sources?” See: eOPS 2010: Electronic Discovery Operational Parameters Survey (PDF posted with author’s permission).

Dutton’s survey resulted in figures that would support the low end of Degnan’s range. The mean (average) response to his survey was 5,244 documents per gigabyte. The median response was only a bit higher at 5,500 documents per gigabyte.

That leaves us with a pretty broad range. Dutton’s figure of 5,200 is only half the so-called industry standard of 10,000 documents per gigabyte. In turn, that figure is less than half of the upper limit of 25,000 documents per gigabyte suggested by Degnan.

So what’s the right number?

Catalyst Data

I have been interested in this question for quite some time but never did anything to pin down the answer. When pricing is discussed, clients are naturally interested in knowing how many documents to expect from their collection efforts. Since collections are typically measured in gigabytes, the “How many documents?” question comes up all the time. So, I decided to take a look at our own data.

We started by taking data from a handful of our cases. This survey was not meant to be scientific but we did look for cases with different types of data. For example, one of our clients sends us a lot of native PDF files. These are not scanned images but rather postscript files and thus their sizes tend to be small. Other cases involved Outlook, Word, Excel and the other Office-type files that you expect from e-discovery. We expect some variance among cases that hold different types of data.

We grabbed our initial data from nine cases, chosen pretty much at random. In total, the cases had just under 8 million files with a total of 1.6 terabytes of storage. Of course, I am not using real case names so I’ve labeled them cases 1 through 9.

I should also note that, following the international standards bodies including NIST, we used 1,000 rather than 1,024 as the standard for converting bytes to gigabytes. Thus, we count 1 billion bytes as a gigabyte rather than 1,073,741,824 bytes. You can read about this debate here. If you prefer to use 1,024 as your multiplier, the difference is about 7 percent. Thus, 5,000 files per GB using the 1,000 multiplier would be 5,370 files per GB based on a divisor of 1,024 (bytes/1024x102x1024).

With all that behind us, here is what we found (with the cases sorted based on files per GB):

This seemed interesting. The bottom line average (total files divided by total GBs) of 4,890 was on par with the 5,000 lowest end of the “accepted” range reported by Degnan and nearly matched the 5,244 average found by Dutton’s survey. Our median was just a bit lower at 4,522. For those, like me, who almost failed statistics, the median value says that half the cases had more than 4,522 files per GB and half had fewer.

I found the disparity between the number of files per GB in the different cases interesting as well. Some cases had very low counts—1,500 files per GB—while others stretched as high as 14,000 files per GB. You can see the variation here:

Because Case 9 was the largest outlier, I took a look at the specific files stored there to see what I could learn about how file types played into these calculations.

This case had a lot of GIFs—possibly logos on email messages that were treated as separate attachments. It also had a lot of JPG files, which also could be logos. Should these be included in our calculations? I could argue the point either way but it does tell us that the composition of the files is an important factor.

The other types of files included on this site had high files-per-GB counts as well. PDFs led the way with 21,000 files per GB but counts for the Word documents also were high with 12,000 files per GB. I suppose that is what makes it an outlier.

Case 8, the other site with a high file-per-GB count, was a PDF site. The client sent us postscript PDF files (native rather than a scanned image). You can see that the file sizes were rather small:

I was not surprised to see that the text files would have a high file-per-GB rate. The postscript PDF files, which were largely emails, were also relatively small.

Case 1 was at the other end of the spectrum. The file distribution looked like this:

Across the board, the files per GB are dramatically lower. Other than to analyze the content of individual files (or just to recognize that file sizes can vary depending on content), I don’t have a reason for the variances. I include it simply because it is interesting.

Here is a summary by file type across all nine sites:

Naturally, these numbers tie into the averages presented earlier. But this analysis also shows that the numbers can fall well below the 5,000 files per GB that Degnan suggested as the low end of the range. In particular, Excel and PowerPoint files average far lower than other file types.

Enlarging the Survey

With my curiosity piqued, I asked our team to look at additional cases. This time, we looked at 20 more cases of different sizes, again chosen pretty much at random. In total, the cases had just over 10 million files with a total of 3.8 terabytes of storage. Here is what we found:

This time, the values were quite a bit lower than in the first sample. The average number of files per GB (total files divided by total GBs) was about half of the lowest figure in the accepted range reported by both Degnan and Dutton. The median was a bit lower at 2,421.

Once again, we saw a range of values across the cases. You can see the variation here (you can also see that we presented this batch of cases in order):

Since it was relatively easy to do, I combined all of the data from the 29 cases to see what that would show me. Here is what I found:

Needless to say, these figures are quite a bit below the ones others have been throwing around. If the true average is closer to 3,300 files per GB, the industry standards will certainly need adjusting.

Normalizing the Data

Given that we have some statistics expertise in our Search and Analytics Consulting Group, I thought I would do a little more analysis of the data we had uncovered. We started by ordering the data by size, from the lowest files per GB to the highest. Here is how it sorted out:

We then decided to remove outliers from the sample as indicated in the chart. This is a common practice as statisticians review sample results. Figures that are way low or way high might throw your analysis off and provide misleading data.

You can see the ones we removed (and agree or not agree with the practice). The impact in doing so changed the figures to an average of 2,544 files per GB with a median of just a bit less—2,296.

We also calculated the standard deviation to be 1,258 documents per GB. I am no expert on this but a standard deviation is a reference to the famed bell curve used to normalize a set of survey results. You can read more about standard deviation from Wikipedia. It is often denoted by the sigma character “σ” and is at the heart of the Six Sigma programs for reducing manufacturing defects.

The goal with a standard deviation analysis is to try to get a handle on how much variance you might expect when you survey future document populations. If your document population is fairly homogenous and follows a normal distribution, you can expect that 68% of the larger document population will fall within plus or minus one standard deviation from your mean (average). At two standard deviations, you can expect that 96% of your documents to fall.

I am not in a position to say that the mean we calculated is representative or that e-discovery document populations are sufficiently homogenous to follow a normal distribution. However, we can take the data we obtained and fit it to a normal distribution curve. It looks like this:

This graph suggests that 96% percent of the population will fall well under 5,000 documents per GB with only the extreme outliers going beyond it.

So How Many Docs in a Gig?

What can we make of these surveys? It certainly suggests that the average number of files per GB is well under the 10,000 figure cited as the “industry standard,” let alone the higher numbers. Our belief is that the true figure is well below even the 5,000 files per GB that Cliff Dutton reported. If so, that could impact a lot of e-discovery estimates.

I am not a statistician and the statistics professor we work with is in France on sabbatical. Nonetheless, we did look at over 18 million e-discovery files that totaled over five terabytes of storage. That is a pretty solid number from which to draw conclusions.

At the same time, let me be clear that the numbers will vary dramatically depending on the types of documents you have. With simple text files, which are rare in native e-discovery populations, the numbers could skyrocket. With postscript PDF files (converted email or short Word files for example), we saw values go well above the 10,000 mark. In contrast, with some of the other file types, PowerPoint and Excel for example, your numbers could go well below our calculated mean of 2,500 documents per GB.

Perhaps others have done more sophisticated analyses that they can share. I offer our figures just to get the discussion started. Let me know what your experience has been with your files.

I want to thank my fellow Catalyst employees, Greg Berka, Kevin Hughes and Nirupama Bhatt, for their help on this article.

John Tredennick About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000 and is responsible for its overall direction, voice and vision.

Well before founding Catalyst, John was a pioneer in the field of legal technology. He was editor-in-chief of the multi-author, two-book series, Winning With Computers: Trial Practice in the Twenty-First Century (ABA Press 1990, 1991). Both were ABA best sellers focusing on using computers in litigation technology. At the same time, he wrote, How to Prepare for Take and Use a Deposition at Trial (James Publishing 1990), which he and his co-author continued to supplement for several years. He also wrote, Lawyer's Guide to Spreadsheets (Glasser Publishing 2000), and, Lawyer's Guide to Microsoft Excel 2007 (ABA Press 2009).

John is the former chair of the ABA's Law Practice Management Section. For many years, he was editor-in-chief of the ABA's Law Practice Management magazine, a monthly publication focusing on legal technology and law office management. More recently, he founded and edited Law Practice Today, a monthly ABA webzine that focuses on legal technology and management. Over two decades, John has written scores of articles on legal technology and spoken on legal technology to audiences on four of the five continents.

Comments

  1. Hey, John.
    You’re right, your statistics are a bit rusty, but aside from that, the general idea of taking the total number of files and dividing by the total amount of storage is probably reasonable and it supports your general conclusion. Great starting point for discussion. Thanks. Not only did you provide interesting data, but you raised a number of questions that I think are worthy of discussion to help customers understand what they have and what they’re getting for it

    What can we make of these numbers? Part of the answer lies in exactly how you measured these things. You said something about post-culling. What did you cull out? How did you do it? Did you denist? Did you use single-instance storage? In other words, how many files you get for a GB will depend on where in the process you measure. SPAM emails might be smaller with fewer attachments than business emails. If people use an image as part of their signature, that might inflate the size of each email, unless you single-instance those images, which would remove that effect.

    I hope that this is not asking too much, but something that I would like to see is an analysis of the overall distribution of file types. How many gif files, how many pdfs, how many word files, how many emails, etc. How do you measure emails with attachments? That would further help everyone to understand what they are talking about when they do their estimates.

    Thanks for the interesting data.

    Herb

    • Thank you for your kind words. I just hadn’t seen an analysis like this. It seemed worth the effort.

      In this case, I looked at files in our repository. By definition, that meant de-NISTed, de-duplicated, many system files removed and perhaps initially culled. My reasoning was to tie into review estimates. Most reviewers are looking at the kinds of documents we were measuring.

      That said, it would be fun to do more statistical analysis on the types of questions you raised. How much gets removed by de-NISTing, de-duping or other efforts. That might or might not prove fruitful. A lot depends on what was collected in the first instance.

      Same point for spam. The goal is to get it out of the population but doubtless there is some in what we counted. And, you are right about images in email. Color TIFFs would change the equation as well. This is inherent in any estimate. The more you know about the population, the better the estimate.

      I will look into an analysis of file type distribution. We did some of that on the first set of cases. It wouldn’t be that difficult to pull out more info from more cases.

      We handle emails and attachments two ways, depending on client preference. One approach is to split out the attachments and save an HTML version of the message body. This is the smallest in storage. The other is to split the attachments but keep the MSG as the native original email. That can really increase storage values because the MSG contains the attachments. We advise clients of this in our processing documentation.

      But to our point here, how many files in a GB would be radically different if you were considering MSG files and attachments rather than HTML and attachments.

      Thank you for your thoughts.

      John

  2. Julie Brown says:

    Hi John!
    I think this is terrific information. I am currently involved in the EDRM metrics project. The metrics team is developing a web-based metrics database where service providers, law firms and corporations can input or upload metrics data. This will allow these individuals to also analyze the results to better predict volumes, time and costs. We are focused on the Processing, Analysis and Review pieces of the EDRM and are trying to keep it as simple as possible in the initial project. We are currently in the process of finalizing the metrics fields we will capture in the initial phase.

    I think file types are helpful but as you identified they can also vary dramatically (a scanned pdf vs. a converted pdf. Do you think the current industry standard is at the high end to avoid getting burnt by going well over budget?

    Finally, I’m curious if you did any analysis on .nsf’s or pst’s? We have seen some outrageous numbers here particularly on expansion rates.

    Thanks for a great write up! Very interesting!
    Julie

    • Thank you for your kind words Julie.

      The EDRM metrics project sounds exciting. Are you concerned that the data being uploaded might vary depending on how it is captured? Without some form of moderation, there is the risk of a lot of bad data getting mixed in with the good.

      Data can definitely vary. The biggest and most obvious example would be a color TIF file. These can be 10 MBs an image or greater. Size matters here as well. We received PDF files that were a GB in size and only had a couple of pages. It turned out they were graphic design files for billboards. The page size dimensions were large which translated into really large data files.

      We have not tried to analyze NSFs or PSTs but could. What we have seen is that most of our Fast Track automated uploads involve a single custodian at a time. Usually the loads are less than 4 GBs in size, often much less. But, you hear higher numbers used all the time. Seven GBs per custodian is one that I hear often.

      It is good to get data out there and to see discussions evolve.

  3. Steve Green says:

    John – thanks for doing such a comprehensive analysis and being kind enough to share it. We’re all out here trying to estimate records from data volumes and this post now sets the standard.

    I was using Ralph Losey’s “How much data do you have?” sidebar for a while, but started going lower after all the processing veterans I spoke to said those figures were laughably high. Your numbers really blow the doors off the conventional wisdom of 10,000 – 12,000 docs/GB – this post is going to be required reading for our PM team!

Share Your Thoughts

*