Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn Catalyst YouTube Channel
Follow Us:
Technology, Techniques and Best Practices

Seminar in Houston this Week on Protecting Privilege Using Defensible Search

There is still time to register for How to Protect Privilege Using Defensible Search and Review Strategies, a live seminar and lunch in Houston on Wednesday. The seminar panel includes Harris County District Court Judge Ravi K. Sandill along with two lawyers who are experts in e-discovery search.

The program will review recent court decisions that illustrate the risks of inadvertent disclosure of privileged documents. Panelists will then review best practices for building searches, for using advanced analytics, any for verifying results through quality control and sampling.

There is no cost to attend the seminar, which is sponsored by Catalyst. Here are the details:

Date: Sept. 21, 2011

Time: Registration and lunch begin at noon. One-hour seminar begins at 12:30 p.m.

Location: Houston Technology Center, 410 Pierce St., Houston.

A registration form, a full description of the program and biographies of the speakers all can be found here.

Nuclear Secrets and Redaction: What Not to Do

It’s been at least a decade since the first stories came about people trying to redact PDF files by drawing boxes over the text they were trying to hide. Maybe you remember some of them. The recipient unlocks the document and removes the boxes. Voilà, the hidden text reappears. Producing party (often some government agency) has egg on its face. We get another one of those stories to chuckle about.

HMS Tireless, a Royal Navy nuclear-powered submarine.

But a decade later, surely we’ve all learned our lesson, right? Everyone knows better than to redact a PDF file simply by covering the text with a black box. Don’t they?

It seems not. The Ministry of Defence (their spelling) and several other British agencies posted “redacted” PDF files on the Internet and made one of the oldest faux pas in our business. You guessed it: They covered the text on the image but did not remove the underlying text. Anyone smart enough to highlight the area or press Ctrl-A could copy the underlying text and paste it in a Word document. All the redacted text was conveniently there for the reading.

These weren’t ordinary redactions, as several British news outlets reported, including The Daily Telegraph, the Daily Star and The Register. Rather, they included secrets about Britain’s nuclear submarines, including expert opinions on how well the fleet could cope with a catastrophic accident.

The documents were published under the U.K.’s Freedom of Information Act. Officials “redacted” them by using Photoshop to paste a black patch over secret text, news reports said.

What’s going on here? Doesn’t everyone know how to redact a PDF file? Nope. They don’t. And it keeps happening.

Facebook

Facebook suffered a similar fate when it settled claims brought against it by ConnectU, the website originally called HarvardConnect that was central to the plot line of the film, The Social Network.

You probably remember. The Winklevoss brothers claimed that Mark Zuckerberg stole their idea and turned it into Facebook, making billions of dollars as a result. The parties settled the case in February 2008 for an undisclosed sum (or at least on terms they didn’t want disclosed). In June of that year, the parties participated in a court hearing, of which the transcript was later published in redacted form. The intent was to block out the parts of the record that would show how much money and stock were paid over to the claimants.

The court published the transcript as a PDF and made it available on the web. You can still see it posted at Justia.com. Page through it and you will see redacted sections. Here is page 24 for example:

You can see the redacted part is blank, which is what the court and the parties intended.

Guess what happens when I select the text tool and highlight the redacted section on the PDF? (I could also just press Ctrl-A and highlight the whole page.)

You see the highlights for the visible and the hidden text. The next step is to press Ctrl-C (or right click and choose copy). Then paste the copied text into a blank Word document (or text file). Now you see what you weren’t supposed to see. It looks something like this:

Let’s just say there was a lot of embarrassment here. And a lot of unfortunate press coverage for the parties.

What’s Going On Here?

Listen up folks. This isn’t that hard. Professionals shouldn’t be making these kinds of mistakes, let alone the Ministry of Defence—particularly involving nuclear secrets. Sheesh.

Adobe files are complex documents. They support multiple layers of images and text. Just because you cover the outer layer of the document doesn’t mean you have eradicated a lower layer of text.

We have worked with PDF files since the mid-1990s, both the image plus text formats and the native postscript files. (In 1999, we were inducted into the Smithsonian Institute for our work using Adobe Acrobat in litigation.)

For most of those years, we have offered our users the ability to redact documents online from their browser. We start the process by converting the native files to the PDF format.

What we don’t do is draw a box over the offending text and call it good. Rather, we extract a page out of the PDF and “flatten” it. By that I mean we convert it from the complex, multi-layered PDF format to a simple one-layer format such as PNG or TIFF. Next we draw a box over the image containing the text (actually the user does this). Then we merge the box and the image together before converting it back to PDF form. The resulting page is merged into the document and saved as a redacted copy.

In this way, the redacted information is gone–both the image layer and the underlying text. There is no hidden layer to be discovered because it has been removed by the flattening of the file and the recreation of the page as a new image file. You can’t remove the box or scrape the underlying text from under the box. It simply doesn’t exist. Indeed, in most cases we then allow the user to OCR the file so that the remaining text is searchable.

There are plenty of other ways to accomplish this. Several years ago, Adobe added a redaction feature to its paid product (not the free Acrobat Reader). If you use the tool properly, the resulting redaction will not be recoverable and there will be no hidden text awaiting prying eyes (or smart computer geeks).

What you don’t want to do is take the approach I saw in a couple comments posted on the Web. They suggested that you place a black box over the offending text and then lock down the PDF (no changes or text scraping allowed, for example). The problem with that approach is this: There are scores of free or almost free PDF cracking tools. I haven’t tested them with the latest versions of Acrobat but last time I tried I could bust open a locked down PDF in a few seconds. Then I could go in and remove that box.

A better option would be to save the file to TIFF and then turn it back into a PDF. This is easy to do with Acrobat and will eliminate any chance of recovering the hidden text. TIFF is a simple format and won’t support a hidden layer. Do that if you don’t have a better alternative.

If your redactions involve secret ingredients for a favored family recipe, the simple place-a-box-over-the-words approach might do the job fine. But if you are a government official with nuclear secrets and a submarine fleet to protect, you should seek help from a pro before you post those documents on the web. You have to love this business!

[Image: basketman / FreeDigitalPhotos.net]

Catalyst Webinar: The Expanding Role of Search in E-Discovery

When searches fall short in e-discovery, the consequences can be serious. In recent years, courts have started closely scrutinizing lawyers’ searches–and even questioning their ability to craft thorough searches. When courts find flaws in a litigant’s search, they are increasingly likely to find waiver of attorney-client privilege, allow adverse inferences, order directed verdicts and impose sanctions that, in some cases, have run into the millions.

On May 23 at noon ET, Catalyst is hosting a free webinar, The Expanding Role of Search in E-Discovery. A panel of leading experts will review the key information you should understand about search in order to protect yourself and your clients. Topics will include:

  • The search cases: From O’Keefe to Hawley Insurance and beyond. Are lawyers qualified to craft and run their own search methodologies? If not, what should I be doing?
  • Defensibility of your search protocol. What must lawyers do to ensure that inadvertently produced privileged documents will be returned and not used in the case pursuant to FRE 502 and “clawback” agreements.
  • Tips and techniques to find privileged documents. Practical advice from search experts on how to find privileged documents and improve your search strategy.
  • Advanced analytical techniques for managing large document populations. New statistical and mathematical techniques to find relevant documents and pare down document populations to target relevant documents.

Speakers for the program will be:

  • Michael Arkfeld. Nationally renowned speaker and author of leading treatises on e-discovery, including Arkfeld on Electronic Discovery and Evidence (Lexis 3rd Ed.) and the annually update Best Practices Guide for Electronic Discovery and Evidence. Michael is a former assistant U.S. attorney with 20 years of trial experience.
  • Charles W. Cohen. Partner at Hughes Hubbard & Reed LLP, where he is a co-chair of the eDiscovery Practice Group and a co-chair of the firm’s Technology Committee. Mr. Cohen has a national and international litigation and e-discovery practice, and has appeared in numerous federal and state courts across the country.
  • John Tredennick. Author of five books and countless articles on legal technology and electronic discovery issues who has spoken to legal audiences on five continents. After 20 years as a trial lawyer and litigation partner for an AmLaw 200 firm, John founded Catalyst Repository Systems, which provides secure, hosted document repositories for electronic discovery.

Read more about this webinar or register now.

NIST Issues Draft Guidelines on Security and Privacy in the Cloud

While everyone who uses cloud computing should be alert to security and privacy issues, lawyers and litigation support professionals have a special responsibility in that regard. Not only are they entrusted with preserving the confidentiality of client communications, but they also play key roles in ensuring that their clients comply with a myriad of laws and regulations pertaining to data. Even so, legal professionals often have far more questions than they do answers about how to evaluate the privacy and security of cloud providers.

Earlier this month, the National Institute of Standards and Technology (NIST) published a draft document, Guidelines on Security and Privacy in Public Cloud Computing (PDF), that provides an overview of the security and privacy challenges pertinent to public cloud computing and suggests factors organizations should consider when outsourcing data, applications and infrastructure to a public cloud environment.

At the same time, NIST launched a new NIST Cloud Computing Collaboration wiki to enable those involved in cloud computing to collaborate in refining the NIST’s standards.

NIST also released a draft that updates its work to create a definition of cloud computing, The NIST Definition of Cloud Computing (Draft) (PDF). NIST is seeking feedback on this draft, as well.

NIST’s Recommended Guidelines

The NIST draft guidelines pertain only to the “public cloud,” which NIST defines this way:

A public cloud is one in which the infrastructure and other computational resources that it comprises are made available to the general public over the Internet. It is owned by a cloud provider selling cloud services and, by definition, is external to an organization. At the other end of the spectrum are private clouds. A private cloud is one in which the computing environment is operated exclusively for an organization. It may be managed either by the organization or a third party, and may be hosted within the organization’s data center or outside of it.

The 60-page draft provides a fairly in-depth discussion of the key security and privacy issues and NIST’s recommendations for how to address them. In summary, NIST recommends:

  • Carefully plan the security and privacy aspects of cloud computing solutions before engaging them.
  • Understand the public cloud computing environment offered by the cloud provider and ensure that a cloud computing solution satisfies organizational security and privacy requirements.
  • Ensure that the client-side computing environment meets organizational security and privacy requirements for cloud computing.
  • Maintain accountability over the privacy and security of data and applications implemented and deployed in public cloud computing environments.

“In general,” NIST adds, “organizations should have security controls in place for cloud-based applications that are commensurate with or surpass those used if the applications were deployed in-house.”

The Security Upside

Even as it addresses security precautions related to the cloud, the NIST report also takes note of what it calls “the security upside.” For many companies, particularly smaller organizations, the cloud holds the prospect of improving their overall security.

Companies may have only a limited number of IT administrators and security personnel. Cloud providers, by contract, offer a number of features that promote security, NIST says:

  • Staff specialization. Cloud providers have staff that specializes in security and privacy.
  • Platform strength. The structure of cloud computing platforms is typically more uniform than that of most traditional computing centers. That enables better automation of security management activities like configuration control, vulnerability testing, security audits and security patching.
  • Resource availability. Redundancy and disaster recovery capabilities are built into cloud environments. Scalable, on-demand resource capacity can be used for better resilience when facing increased service demands or distributed denial of service attacks, and for quicker recovery from serious incidents.
  • Backup and recovery. The backup and recovery procedures of a cloud provider may be superior to those of a company or firm. Data maintained within a cloud can be more available, faster to restore, and more reliable than that maintained in a traditional data center.
  • Mobile endpoints. Because the main computing resources are with the cloud provider, they can be accessed using lightweight and easy-to-maintain computers such as laptops, notebooks and netbooks, as well as embedded devices such as smart phones, tablets and PDAs.
  • Data concentration. Data maintained and processed in the cloud can present less of a risk to an organization with a mobile workforce than having that data dispersed on portable computers or removable media out in the field, where theft and loss of devices routinely occur.

NIST has put its development of final guidelines on a fast track at the request of Vivek Kundra, the U.S. government’s chief information officer. He wants to accelerate the federal government’s adoption of cloud computing and ensure that it is done securely.

NIST has set Feb. 28, 2011, as the deadline for submitting comments on these drafts.

Here’s where to read more:

Common E-Discovery Error #7: Searching for Non-Indexed Documents

This is the seventh in a series of posts on common e-discovery errors and how to avoid them. For background on the series, see the introduction.

No matter how diligently you search, you can’t find something if it isn’t there. Such is the case with document text. If the text is not indexed, your search will not find it. Forgetting this simple fact is an all-too-common error in e-discovery searches.

Search engines don’t actually search the text of the documents. Rather, they search an index of the text of the documents. If the document contains no readable text to be indexed, or if the document is not properly indexed, then the document will not be found.

The danger in this is that the searcher may believe the search was complete when, in fact, it was not. The search may have found all matching text, but it may have overlooked matching documents whose text was not readable.

Why Would Documents not be Indexed?

Even though a document or file contains text that you and I can see, the text may not be viewable by the software that creates the search index. The simplest example is a photograph of a sign. If we look at the photograph, our eyes can read the text on the sign. But indexing software, without an OCR component, is unable to distinguish that text from any other part of the photo.

There are several reasons why a document’s text may not be indexed:

  • No text due to file type. Typically, text can be extracted only from standard applications. The document may have been created in an application from which text cannot easily be extracted, or it may be a photo or other file that has no text.
  • No OCR (or bad OCR). A file that is scanned into TIFF without text or PDF (image only) format requires OCR software to create the indexable text. If the file is not run through OCR or if the OCR program or process is deficient, the text will exist in a way that it cannot be indexed.
  • Hardware or software error. As with any computer application, sometimes there can be “hiccups” in the indexing process, requiring that the documents be re-indexed. To avoid this situation, QC following indexing is essential.
  • Human error. A technician can make mistakes that prevent proper indexing, such as incorrect mapping to the text in the load file or incorrect copying of the data so that the mapping is incorrect. Again, QC should uncover these errors.
  • Password protection. Indexing will skip password-protected documents. You need to identify them and either get the password or break the password (if possible).
  • Office 2007/2010. When Microsoft first released Office 2007 and its new document format, many systems were unable to read it. By now, most systems have caught up and have the ability to extract text from Office 2007/2010 files. However, a document collection may include Office 2007 documents that contain non-indexable text (or text that hasn’t been indexed).
  • Non-Western language issues. As mentioned in a prior blog post (Common E-Discovery Error #6: No Comprende), languages that are not standard Western languages–such as Japanese, Chinese, Korean and Arabic–can raise indexing issues. In the UTF-8 form of Unicode supported by most e-discovery search engines, accented Latin characters as well as Greek, Cyrillic, Arabic and Hebrew characters occupy 2 bytes each. All other characters, including Chinese, Japanese, Korean and Indian, occupy 3 bytes each. Supplementary characters occupy 4 bytes each. Each language must use the proper encoding to index correctly. If a document in Japanese is saved as ASCII or incorrectly converted, it won’t index correctly.

There is one other tricky aspect to this that can trip you up if you’re not careful. Sometimes a document that has no readable text in its body will not be reported as a non-indexed document because there was text in the document properties that was indexed. For example, if there is a scanned PDF that was not OCR’d, but that has any text in the document properties, it will not be reported as a “non-indexed” document, even though the text of the PDF is not indexed.

The Danger of Non-Indexed Documents

There is an old proverb, “What you don’t know can’t hurt you.” Unfortunately, that is not the case in e-discovery.

One danger when the search index omits document text is that counsel will fail to produce a responsive document. If this happens by a good faith mistake, it is unlikely to result in sanctions. For one, the other side is unlikely to know about the omission. For another, the attorneys for the producing party may not even know about it.

The greater danger in this scenario is that counsel will inadvertently produce a privileged document–and that IS a big problem. That was part of what lead to the disastrous outcome in Mt. Hawley Insurance Company v. Felman Production Inc., 2010 WL 1990555 (S.D.W.Va., May 18, 2010). a case in which counsel inadvertently produced a “smoking gun” document. (Even if the court enforced the clawback agreement there – see Bad Facts Make Bad Law: ‘Mt. Hawley’ A Step Backward for Rule 502(b) — turning over the smoking gun documents that didn’t hit on the privilege search would still have been a disaster. This bell could not be unrung.)

Consider what happens in a responsiveness search when text is not included in the search index:

  • The review only includes search “hits,” so the non-hits aren’t reviewed and aren’t tagged as responsive.
  • If it is a “produce everything that isn’t privileged” review, only the documents that hit on the search terms are produced. There may be responsive documents that are not produced.

Now compare the possible outcomes in a privilege search:

  • In cases where every responsive document is being reviewed for privilege with help from searches to find “potentially privileged” documents, more privileged documents are likely to slip through undetected if they are not tagged as “potentially privileged.”
  • Worse, in cases where all non-privileged documents are being produced, non-indexed privileged documents, if any, would be produced because they didn’t hit on the privilege searches.

Best Practices for Avoiding Indexing Disasters

Non-indexed documents pose dangers, but they are dangers you can minimize or avoid altogether. Here are best practices we at Catalyst’s Search & Analytics Consulting Group recommend:

  • After data is loaded, get exception reports that show non-indexed data by file type, and look carefully at the counts.
  • Sample the non-indexed data by file type. Does any of the data need OCR?
  • When you run searches, be sure to sample the non-hits as well as the hits.
  • Remember to do sweep searches later to look for any documents that become searchable but that were not searchable when searches were previously run.

The rules don’t require perfection – just reasonability. Take reasonable steps to avoid these non-indexed document issues.

Text that is visible to the human eye is sometimes invisible to the computer. Forget that and you may end up with productions that are under-inclusive or over-inclusive. It is a common mistake, but one you can easily avoid.

Common E-Discovery Error #6: No Comprende

This is the sixth in a series of posts on common e-discovery errors and how to avoid them. For background on the series, see the introduction.

Businesses operate in a global economy that spans borders, cultures and languages. One increasingly common result of that is that e-discovery collections contain documents in multiple languages. Search and review of multiple-language collections is fraught with potential pitfalls.

The Catalyst resource library includes several case studies and white papers addressing the challenges of multi-language search and review. The challenges for search can be complex, particularly for the so-called CJK languages—symbol-based languages such as Chinese, Japanese and Korean.

That said, there are a handful of mistakes we commonly see made with regard to multi-language document collections. We’ll briefly review them here.

1. Ignoring non-English documents.

You’ve no doubt heard that ancient Japanese maxim, “Hear no evil, see no evil, speak no evil.” But you’d be shocked by how commonly that maxim gets applied in the context of e-discovery.

seenoevil

This is not an advisable approach to e-discovery.

Wittingly or not, lawyers sometimes simply ignore non-English documents. They run English-language search terms against a multi-language database and accept the results as final. Or if questions do arise during the search or review, they are never resolved before production.

Bottom line: English-language search terms will not find foreign-language results.

2. Inaccurate search translation.

Your friend’s daughter studied Japanese in college and offers to translate your search terms. You can’t beat the price. But does she know that Japanese uses four different alphabets? If so, does she know all four? Is she familiar with Japanese usage of business and legal terms? What we call a “contract” or “agreement,” the Japanese often refer to as a “promise.” [Read more...]

The Slippery Issue of Relevance in Document Review

By Rupa Bhatt*

Electronic discovery is a multi-stage process of collecting, processing, analyzing, searching and reviewing electronically stored information. Yet all these stages are directed towards two basic tasks: identifying relevant information and preventing the disclosure of privileged information.

For cases involving significant collections of ESI, a key stage in the accomplishment of these two tasks is review. Even as search and analytics enable more sophisticated identification of documents through computer technology, there is still the need in virtually every case for eyes-on review by trained reviewers.

But if, as the saying goes, “Beauty is in the eyes of the beholder,” the same can be said for relevance. It is a slippery concept and its application is somewhat subjective. As reviewers move from document to document, they make judgment calls about each one’s relevance. Their judgment calls are not always perfect or consistent.

The fact of the matter is, when it comes to e-discovery document review, a certain number of inconsistent review calls are not only to be expected, but are unavoidable. Given this, how can you be better prepared in order to prevent these incidents from impacting your outcomes?

Later in this post, I will offer some suggestions on best practices to follow in order to minimize the impact of inconsistent relevance determinations. But first, some background.

Guideposts for Determining Relevance

The federal rules provide the two primary guideposts that we follow in defining relevance in discovery and at trial:

FRCP Rule 26: Duty to Disclose; General Provisions Governing Discovery. (b) Discovery Scope and Limits. (1) Scope in General. Unless otherwise limited by court order, the scope of discovery is as follows: Parties may obtain discovery regarding any non privileged matter that is relevant to any party’s claim or defense. … Relevant information need not be admissible at the trial if the discovery appears reasonably calculated to lead to the discovery of admissible evidence.

FRE Rule 401: Definition of “Relevant Evidence.” “Relevant evidence” means evidence having any tendency to make the existence of any fact that is of consequence to the determination of the action more probable or less probable than it would be without the evidence.

Given the broadness of these definitions, the determination of relevance can be a highly subjective judgment call. At the least, it is dependent on the case and on the counsel or reviewer. And the impact of that judgment call can be wide and varied. It can affect every subsequent stage of the e-discovery and document review process.

It is often the fact that the reviewer’s subjective determination impacts the relevance determination for other similar documents. Given this, one can always question the relevance assessment criteria – human and subjective, electronic and objective — applied at every stage of document review. These judgment calls in coding documents are often referred to as “coding inconsistencies” or “bad coding calls.” Such inconsistencies can have serious consequences, ranging from production of irrelevant or privileged documents to the imposition of judicial sanctions such as an adverse inference.

How Reviewers Determine Relevance

Relevance does not behave. People behave. Although various factors can contribute to deviations in relevance determinations, the single-most important factor is the subjectivity of the reviewer.

A paper produced by the TREC 2008 Legal Track described it this way:

While the ultimate determination of responsiveness (and whether or not to produce a given document) is a binary decision, the breadth or narrowness with which “responsiveness” is defined is often dependent on numerous subjective determinations involving, among other things, the nature of the risk posed by production, the party requesting the information, the willingness of the producing party to face a challenge for underproduction, and the level of knowledge that the producing party has about the matter at a particular point in time. Lawyers can and do draw these lines differently for different types of opponents, on different matters, and at different times on the same matter. This makes it exceedingly difficult to establish a “gold standard” against which to measure relevance/responsiveness and explains why document review cannot be completely automated.

Hence, the interplay of all these factors defines and directs the outcome of the review itself.  Some examples of “influencers” that may impact the relevance determinations include:

Human/subjective factors. These factors can include the prior experience and knowledge of the reviewers as a whole (i.e., Is this an off-shore review team unfamiliar with U.S. litigation? Is the review team trained on the review platform?), as well as issues with individual reviewers’ ability to grasp, perceptions, ideas, concepts, emotional state, cultural norms, etc.).

Instead of “subjective,” it may be more appropriate to say that discovery involves judgment about the situation as well as about the documents and their contents. Some judgments bias the reviewer to be more inclusive and some bias the reviewer to be less inclusive, but these judgments are not made willy-nilly. As opposed to pure errors, which are random, these judgment calls are based on a systematic interpretation of the evidence and the situation.

Objective factors. These factors can include relevance criteria as presented by lead counsel (i.e., how complex is the review in terms of numbers of issues and parties? How many reviewers are there? What is the timeline? Is the review happening in different geographical locations and time-zones? What is the review workflow? Is it well-formulated?).

Technology factors. These can include the review platform (i.e., How easy is it for counsel and reviewers to navigate through documents? How fast is the tool? Are there any technological limitations hindering the review?).

relevance triangle

Figure 1

Ultimately, the determination of document relevance involves the interplay among these three factors and can be represented in many ways, as illustrated by Figure 1 and Figure 2.

relevance strata

Figure 2

Apart from the above factors, the reviewer is also intuitively — based on the matter, training and information provided before the start of review — and at a sub-conscious level figuring out a pattern while trawling through the documents.

Research indicates that reviewers look for clues and triggers in the document. While there is no wide consensus, on a general level, we can group these triggers into the following categories:

  • Overt and latent semantic content: Topic, quality, depth, scope, treatment, clarity.
  • Object: Characteristics of the document, e.g. type, organization, representation, format, availability, accessibility, information flow, threads.
  • Validity: Accuracy of information provided, authority, trustworthiness of sources, verifiability with coding docket.
  • Situational match: Appropriateness to situation or tasks, urgency.
  • Belief match: Credence given to information, acceptance as to truth, reality, confidence.

Reviewers’ Learning Curve as a Factor

As a reviewer progresses through documents, the reviewer’s knowledge of the matter and comfort with the review platform improve. As they do, the accuracy of document relevance determinations also improves.

We typically see more coding inconsistencies within the first five days of a review project. During this time, the reviewers are learning the matter as well as getting a feel for the underlying documents. A reviewer’s understanding and knowledge of the review platform can either benefit or hinder the ability to accurately assess document relevance.

By way of illustration, many of you might have answered market research surveys. Observe that a question might have been asked three different times using different terms and in different sections of the questionnaire. Ever wondered why? It is because the researcher is checking the reliability and consistency of the individual’s response over time. Invariably, there is an inconsistency observed in the individual’s response.

Another example is the study conducted by Ellen M Voorhees (2000) with TREC data. One set of reviewers was requested to review a random sample of documents and assess for relevance pursuant to pre-defined criteria. A second team then assessed the same set of documents (already coded for relevance). On analyzing the results, it was determined that of the documents considered relevant by the first set of reviewers, only 80% of the documents were considered relevant by the second set of reviewers. So what changed? The documents and the pre-defined criteria were the same, but the reviewers were different.

Best Practices to Prevent Inconsistent Relevance Assessments

By now you get the picture. Relevance is a slippery and subjective concept. As I said at the outset, a certain number of inconsistent review calls are not only to be expected, but are unavoidable. That does not mean, however, that you should throw your hands up in defeat. There are various best practices you should employ in review in order to minimize the number and impact of relevance miscalls.

  • Plan, plan, plan. It is imperative to walk through a review workflow and to integrate counsel into each stage of the quality control and assurance process.
  • Provide repeatable, detailed case training and easily accessible case documents. If you use an outside review team, consider recording their training session. The recording will always help the team revisit the issues that may come up repeatedly and can be used as a training tool for new reviewers.
  • Choose a good team. In most cases, the team of reviewers can either win or lose the case. Hence, choose wisely.
  • Implement a thorough review workflow. The workflow should include escalation points. Supplement it by creating a detailed review journal outlining your expectations for reviewers. Clearly identify the issues and the criteria for relevance. Make certain to find example documents to use for training. Address the logistics of the review, including how quickly reviewers should move through documents and how they should handle parent–child relationships, email threads and near duplicates, redactions and annotations etc. Our suggestion is to create “cheat sheets” for easy reviewer reference throughout the review project.
  • Identify a senior attorney as a dedicated subject-matter expert. During the initial days of a review, it is a necessity to ensure strict quality control procedures are followed to identify inconsistencies in coding. Involving counsel in the initial days – to answer questions,  offer insight, provide further clarifications to what may seem like minor points, participating in “hands on” quality control, providing further ad-hoc, on-demand training etc. — can offer immense benefits to ensuring accurate and consistent document coding.
  • Build-in tools for quality control. Your review process should include QC checks for coding calls regarding issues, privilege, responsiveness etc. The process should provide timely reports of review quality, accuracy, speed, code reversals etc. The system should provide alerts for inconsistent coding across families. If a parent document is tagged responsive, the tool should be able to tag all the family members as responsive (or something similar). It is a good practice to use automated rules and intelligent auto-coding methodologies, as well as bulk updates. To ensure uniformity and prioritize review, consider leveraging software analytics, such as classifiers, clustering and e-mail threading.
  • Second level QC plan. This should be developed by senior attorneys, specifically to address the case at hand.
  • Review audits. Conduct these periodically and document findings, successes and failures.
  • Employ sampling. Sample documents extensively for relevance and privilege.
  • Post review audit. You should conduct an extensive post-review audit and QC check.

These best practices won’t make relevance any less slippery a concept to nail down. But they can help prevent the review process from slipping out of your control.

*Rupa Bhatt is employed by Catalyst as a member of the Catalyst Consulting team.

Setting Up Review Workflows for Multi-Language Documents

The world is getting smaller. For large corporations, it is virtually certain that their operations span multiple countries. But it is no longer just large corporations that operate globally. These days, even small- and mid-sized businesses are likely to have international components.

When a business is international, then any legal matters involving that business are also likely to be international in scope. In the context of litigation or a government investigation, that means the matter is likely to involve documents in more than one language. Often, such cases will involve collections of documents in a number of different languages – or even single documents containing multiple languages.

In other words, multi-language documents are a fact of e-discovery life these days. For e-discovery professionals, processing and review of multi-language collections raise a number of issues. In this post, I want to talk about one – review workflow.

Language Identification

The start of any successful multi-language review begins with computerized language identification. While most platforms support language identification, they tend to vary greatly in efficiency. Language identification uses built-in dictionaries to identify the primary (and sometimes secondary) language present in a document. This information can be used to route documents to the appropriate reviewer or to distinguish which documents need to be sent out for translation before they can be reviewed.

In the case of Chinese/Japanese/Korean (CJK) documents, language identification is less precise than when dealing with a Western Language character set. Frequently, document headers, email formatting, and email signatures contain CJK text while the substantial portion of the record would be in a Western language. This minimal amount of text causes the entire document to be coded as CJK. To navigate this problem, use a search tool that can both tokenize and count Western versus CJK words within a given document. These numbers can help establish a baseline to determine the true language of a document.

Language Specific Batching

With the limited number of foreign language reviewers and the often high cost associated with obtaining their services, it is important to have a clear system for assigning foreign language documents.

While most foreign language reviewers can review both their native language and English, you don’t want them wasting their time reviewing a document that 90% of your other reviewers can read. For English documents, documents with an unknown language, or documents with no text, these should go to your English reviewers. Only documents containing a non-English language should go to your foreign language reviewers.

Keep in mind though, when reviewing by document families, if any document contains a foreign language, the entire document set should go to your foreign language reviewer.

Flexible Workflows

No language identification is perfect. Inevitably reviewers are going to come across documents they can’t read. That’s why it is important to have a flexible workflow when setting up your review.

Take advantage of a rule-based review platform to re-route documents to the appropriate reviewer. If an English reviewer comes across a Russian document, the platform should provide the ability to reassign that document without being lost in the shuffle. If no reviewer exists with that language proficiency, incorporate a system where a coder can tag a document for translation.

N.C. Ethics Opinion on SaaS Merits Broader Inquiry

The Ethics Committee of the North Carolina State Bar issued a proposed ethics opinion recently that could break significant ground. As we noted in an earlier post, the committee was asked whether a law firm could use a SaaS (Software as a Service) provider to store confidential client data or documents. The specific question was this:

SaaS for law firms may involve the storage of a law firm’s data, including client files, billing information, and work product, on remote servers rather than on the law firm’s own computer and, therefore, outside the direct control of the firm’s lawyers.

Given the duty to safeguard confidential client information, including protecting that information from unauthorized disclosure; the duty to protect client property from destruction, degradation, or loss (whether from system failure, natural disaster, or dissolution of a vendor’s business); and the continuing need to retrieve client data in a form that is usable outside of the vendor’s product; may a law firm use SaaS?

Yes, You Can Use SaaS Providers

Not surprisingly, the committee answered a resounding “Yes” so long as the law firm takes steps to minimize the risk of inadvertent disclosure of client confidential information.

That makes sense to me simply as a matter of practicality. The market is quickly moving toward the SaaS delivery model because it is cheaper and provides better functionality and features so long as you are connected to the Internet. If the trend continues, there may not be any client-based software in a few years. It may all be delivered as a service over the Internet by SaaS providers.

Best Practices for Dealing with SaaS Vendors?

I found the next part of the opinion even more interesting. The committee went further to offer what it called “best practices” for selecting SaaS vendors.

The specific question was this:

Are there any “best practices” that a law firm should follow when contracting with a SaaS vendor to minimize the risk?

Again, the answer was “Yes.” The committee suggested that a lawyer be able to answer the following questions satisfactorily in order to conclude that the risk of inadvertent disclosure is minimized.

  • What is the history of the SaaS vendor? Where does it derive funding? How stable is it financially?
  • Has the lawyer read the user or license agreement terms, including the security policy, and does he/she understand the meaning of the terms?
  • Does the SaaS vendor’s Terms of Service or Service Level Agreement address confidentiality? If not, would the vendor be willing to sign a confidentiality agreement in keeping with the lawyer’s professional responsibilities? Would the vendor be willing to include a provision in that agreement stating that the employees at the vendor’s data center are agents of the law firm and have a fiduciary responsibility to protect client information?
  • How does the SaaS vendor, or any third party data hosting company, safeguard the physical and electronic security and confidentiality of stored data? Has there been an evaluation of the vendor’s security measures including the following: firewalls, encryption techniques, socket security features, and intrusion-detection systems?
  • Has the lawyer requested copies of the SaaS vendor’s security audits?
  • Where is data hosted? Is it in a country with less rigorous protections against unlawful search and seizure?
  • Who has access to the data besides the lawyer?
  • Who owns the data—the lawyer or SaaS vendor?
  • If the lawyer terminates use of the SaaS product, or the service otherwise has a break in continuity, how does the lawyer retrieve the data and what happens to the data hosted by the service provider?
  • If the SaaS vendor goes out of business, will the lawyer have access to the data and the software or source code?
  • Can the lawyer get data “off” the servers for the lawyer’s own offline use/backup? If the lawyer decides to cancel the subscription to SaaS, will the lawyer get the data? Is data supplied in a non-proprietary format that is compatible with other software?
  • How often is the user’s data backed up? Does the vendor back up data in multiple data centers in different geographic locations to safeguard against natural disaster?
  • If clients have access to shared documents, are they aware of the confidentiality risks of showing the information to others?
  • Does the law firm have a back-up for shared document software in case something goes wrong, such as an outside server going down?

These are pretty hefty requirements. I am not sure most lawyers will be able to call Google or Microsoft and get answers to these questions.

Moreover, some are open to debate. For example, is the committee requiring that a lawyer only use a SaaS vendor with data centers in different geographical locations? If so, that will add to the service costs. I don’t know of many law firms that save their data to multiple data centers to protect against a natural disaster. In my experience, most keep their backups in the same vicinity as their primary files. Some keep the backup tapes in the same office.

The basis for the committee’s opinion is pretty interesting. The committee cited email recommendations from Erik Mazzone, the director of the Center for Practice Management at the North Carolina Bar Association. It also referred to the ABA Legal Technology Resource Center.

I don’t challenge Mr. Mazzone’s recommendations so much as suggest that these kinds of issues merit broader inquiry. The opinion is one of the first on the subject, which means it will be persuasive to the next bar dealing with this issue. There is certainly the chance that the recommendations will be picked up as precedent and codified as the standards for dealing with SaaS vendors. I hope there is more discussion on some of these points before the cement hardens.

To be fair, the committee issued this as a tentative opinion in an attempt to generate comments. Moreover, they expressly stated that the list was not meant to be all-inclusive and suggested “consultation with a security professional competent in the area of online computer security.” They also noted that “given the rapidity with which computer technology changes, what may constitute reasonable care may change over time and a law firm would be wise periodically to consult with such a professional.”

All in all, I commend the committee for a thoughtful opinion that heads in the right direction. I hope others pick up this debate and add their ideas. SaaS is the future, both for business and the legal profession. Lawyers will use SaaS providers and clients will benefit through better and cheaper services. Let’s hope the other bar associations agree.

Proposed 2010 Formal Ethics Opinion 7, Subscribing to Software as a Service While Fulfilling the Duties of Confidentiality and Preservation of Client Property (April 15, 2010).

Best Practices in Name Searching for Attorney-Client Privilege

Note: This is the first in a series of posts about protecting documents that are subject to attorney-client privilege. Future posts will cover such issues as creating privilege logs, avoiding doc-by-doc privilege review using categorization approaches, searching for legal terminology, and privilege searching as part of quality control. We will also comment on interesting legal opinions that interpret FRE Rule 502 and related issues.

When I first began doing e-discovery consulting, one of my first projects was to take a list of 4,000 attorney names provided by a corporate client to set them up in Power Search, our batch searching program. I spent quite a bit of time testing various alternative searches, and developed rules as to the best approaches to find attorney names in order to tag them as potentially privileged.

Following Victor Stanley v. Creative Pipe, 250 FRD 251, 2008 WL 2221841 (D.C. M.D. 2008), and the adoption of Federal Rule of Evidence 502, an attorney must have a documented and formalized way of searching for potentially privileged documents.

We now manage a list of 2,000 firms and almost 6,000 attorney names for document upload for every case we handle for a large corporate clients.

Here is how we do it:

NAMES

We recommend searching both the document text and the metadata using the following approach:

  • If the person does not have a unique first or last name, search for (firstname or nicknames or FirstInitial) NEAR/2 LastName. Including the first initial find more documents, but also more false positives, so whether to use it is a judgment call.
  • If the person has a unique last name, then just search for the last name. Qualifying it with first name will just miss documents in which the person was referred to by last name alone.
  • Similarly, if the person has a unique first name, just search for the unique first name, and also search for the first initial near last name, like this: Firstname OR (FirstInitial near/2 Lastname). If both names are unique, search for each separately.
  • Search for all forms in which the names may appear, including typical misspellings. For example, if the person’s name has a space or punctuation in it (or if your search engine treats punctuation as a space), you would search for (O’Hara or Ohara).
  • Remember that a letter with an accent is a different Unicode character than a Unicode character without the accent, so if the name has an accent, search for it both ways. For example, you may need to search for (Göthe or Goethe) or for (Jose or José).
  • If an attorney (usually a woman) changes her name with a change of marital status and uses both her maiden name and married name, such as Susan Jones Smith, be sure to search for her first name(s) Near/2 (Jones or Smith). If you just search for “Jones Smith,” you will miss her earlier communications.

Frequently firms will send us a list of names that include nicknames instead of formal first names, such as “Skip Jones.” If the client sends us “Mike Smith,” we will search for Mike or Michael. But we have no way of knowing how to search for the formal name of “Skip,” so take the extra time to find out the formal names of all of the attorneys on the list.

EMAIL ADDRESSES

Make the extra effort to get the email addresses of the attorneys and legal staff. Frequently we receive a list of names without email addresses. Searching for the law firm domain names will catch most of the law firm attorneys, but not necessarily in-house counsel or use of other domains by attorneys and staff, such as emails from Blackberry addresses or Gmail.

If we do not have the email addresses, we will often search for first initial plus last name, such as “RJones” for Robert Jones, or whatever convention the law firm or company uses for email addresses.

FIRM NAMES

Searching law firm names and domains is generally straightforward.

  • If the firm name or domain has changed, be sure to search for all versions.
  • Search for the smallest number of significant words that will find the firm. That is, when searching for Skadden, Arps, Slate, Meagher & Flom, LLP, you’ll find more if you leave off the “LLP,” more still if you just search for “Skadden Arps”, and more still if you just search for “Skadden.” We always leave off the LLC or PC, and usually just search for the first two names of multiple named firms.

POPULATING THE REVIEW FORM

To aid the reviewers, we often create a field called “PrivilegeTerms” on the review form and populate it with the terms that hit on the privilege search. That makes it easy for the reviewers to see why the document was tagged as potentially privileged.

SAMPLING

After you have tagged the documents as potentially privileged, you should generate a hit report by name and look for terms that seem unreasonably high, and sample those for false positives. In one case, for example, the name of an attorney in Cleveland was the same as the name of an important engineer in a patent case, and had so many hits that it jumped out. Sampling those documents easily found the false hits so that we could figure out the best way to search for just the attorney.

Also, per Judge Grimm in Stanley Pipe, of course you should also sample the documents that did NOT hit on the searches if the review is not a document-by-document review in which the attorneys will be reviewing each document.

CONCLUSION

Over time, trial and error, we have found these to be the best approaches to find attorney names in potentially privileged documents. We welcome your feedback and any suggestions you have for alternative approaches.