Catalyst Repository Systems - Powering Complex Legal Matters

E-Discovery Search Blog

Catalyst E-Discovery Search Blog RSS Follow Catalyst on Twitter Join Catalyst on Facebook Catalyst on LinkedIn
Follow Us:
Technology, Techniques and Best Practices

Article: Meeting the Challenges of Asian Language E-Discovery

As e-discovery reaches across borders into Asia, global companies face new and often unfamiliar challenges. Whatever the nature of the case, if it involves electronic information stored in China, Japan, Korea or elsewhere in Asia, be advised: You’ll be managing case files differently than you would be if you were in the United States.

The challenges presented in managing electronic files in Asia stem from many causes—some geographical, some technical and some cultural.

In Asian countries, the laws governing data and privacy are quite different than in the U.S. For example, in China, collecting and exporting data involving “state secrets” can get you thrown in jail. In Japan, taking data out and hosting it in the U.S. may cause you to lose your client.

Language, too, presents multiple challenges. The so-called CJK languages (Chinese, Japanese and Korean) are the most difficult to process, search and review. Mangle the processing and you lose your data. Mess up the search and you may as well have lost your data. Either way, your review becomes costly and ineffective.

In an article published in the February/March 2013 issue of Todays General Counsel magazine, “Challenges of Asian Language E-Discovery,” John Tredennick, President and CEO of Catalyst, and W. Peter Cladouhos, Esq., firm-wide Practice Support Electronic Discovery Consultant for Paul Hastings LLP, outline some of the most common, and the most critical, challenges companies face when handling Asian data and keeping Asian e-discovery on track and on budget.

Webinar Offers Guidance on Mitigating Risk in Cross-Border E-Discovery

Join two leading authorities in international e-discovery for a free, one-hour webinar, Cross-Border E-Discovery: Meeting the Challenges and Mitigating the Risks, to be held on Wednesday, Sept. 21, 2011, at noon Eastern time.

The webinar will explore the challenges for multinational corporations engaged in cross-border e-discovery–from data privacy laws and discovery-blocking statutes to language and cultural issues–and offer tips for mitigating risk.

Panelists for the webinar will be:

  • Maura R. Grossman, counsel at Wachtell, Lipton, Rosen & Katz. Ms. Grossman’s practice focuses on advising lawyers and clients on legal, technical, and strategic issues involving electronic discovery and information management, both domestically and abroad, as well as on matters of legal ethics. Ms. Grossman speaks and writes frequently on e-discovery and legal ethics and is a member of several Sedona Conference working groups.
  • Richard Kershaw, Asia managing director for Catalyst Repository Systems. A fluent Japanese speaker, Mr. Kershaw has lived and worked in the Asia region since 1996. Over the years, he has successfully led forensic data management assignments in arbitration, litigation and regulatory investigations across the region, including matters in Saudi Arabia, India, Indonesia, Singapore, Malaysia, Hong Kong, China, Taiwan, the Philippines and Japan.

For more details about this webinar or to register, visit: Cross-Border E-Discovery: Meeting the Challenges and Mitigating the Risks.

With FCPA Actions on the Rise, Search Takes Center Stage

Corporate Counsel magazine recently issued a report that should cause multi-national corporations and their counsel to pay attention: Trend Watch: Foreign Bribery Actions Doubled Last Year.

Specifically, the magazine reported that enforcement actions under the Foreign Corrupt Practices Act (“FCPA”) nearly doubled in 2010, rising to 76 (with complaints against 23 companies and 53 individuals). In 2009, the SEC and Justice Department brought 45 actions (against 12 corporations and 33 individuals). That number was a significant jump again from 2008 when the government brought 37 actions against companies and individuals.

The pace seems to be continuing as well. This month, Paul Hastings, one of the leading international firms advising on FCPA investigations, issued its first Quarterly FCPA Report for 2011 [PDF]. So far this year, it reports, enforcement continues apace, with actions brought against four companies and seven individuals, along with a blockbuster forfeiture and a number of guilty pleas and settlements. The forfeiture amounted to nearly $149 million and related to a high-profile arms contract case involving 22 indicted defendants.

Another international law firm, Herbert Smith, in an article, Developments in Anti-Bribery Legislation: The UK Bribery Act and its Impact for Japanese Companies [PDF], reported that, of the 10 all-time largest FCPA settlements, eight were achieved in 2010 and eight (together totaling over US$ 2.25 billion) were settlements with non-U.S. companies.

A lot of the recent activity seems to relate to the changing of the guard after the 2010 election. Under the Bush administration, FCPA enforcement happened but was not a priority. Under Obama and the Democrats, FCPA investigations seem to be a priority. As Assistant Attorney General Lanny A. Breuer said in a speech at a recent national FCPA conference, “We are in a new era of FCPA enforcement [and] we are here to stay.” (Also see our earlier post, DOJ’s Breuer Vows Heightened FCPA Enforcement.)

Add to all that the recent enactment of the Dodd-Frank Wall Street Reform and Consumer Protection Act, which provides that “whistleblowers” who provide information to U.S. authorities leading to successful prosecutions under the FCPA may be entitled personally to huge sums as a result (up to 30% of the monetary recovery). (See Fried Frank’s client memorandum, New Incentives for Foreign Corrupt Practices Act Whistleblowers: Dodd-Frank Wall Street Reform and Consumer Protection Act [PDF].)

At the least, the government had over 140 prosecutions and investigations underway in 2010, according to EthicalCorp.com. That figure is dramatically higher than previous years under prior administrations.

All you can say is watch out.

What Does This Have to Do with Search?

A lot actually. FCPA investigations typically involve hundreds of thousands or even millions of documents collected from all over the world. Sometimes, the investigations are initiated after the government issues a complaint. In those cases, counsel have a starting place for their investigation. The government has a good-faith obligation to set forth the basis of its complaint and that should alert counsel as to the people to interview and the subject matter for their searches.

In other cases, it doesn’t work that way. The FCPA laws impose liability on an acquiring corporation when mergers occur. That means that the buyer of another company could be held liable for bribes and other corrupt activities that occurred even before the merger. That is true even if the buyer never did anything wrong.

In that regard, it is a bit like buying a U.S. company that has plants and property. If you later determine that the property you are buying is contaminated with toxic chemicals, you may be facing expensive Superfund liability. It doesn’t matter that you didn’t release any pollutants at your company or at the new site you acquired. You’re stuck nonetheless.

Superfund’s broad environmental liability has led to a growing and lucrative practice for environmental audit companies. The same is true for the FCPA. Some of the largest law firms in the world, with the depth and geographic coverage to mount these investigations, offer specialized FCPA practices. Paul Weiss is one such firm but there are a number of others in the game.

The key difference is this: If you are doing an environmental audit, you know exactly where the property is. You can send out your teams to inspect the ground, review the chemical history of the plant, check where materials were dumped and even drill for problems. It is just a matter of money but you can certainly find problems if you are engaged to take a look.

FCPA Investigations are Not Easy

What about in an FCPA investigation? Well, the problem is a bit different. First, what kind of fraud are we looking for? Counsel can’t exactly assemble the staff and ask for a show of hands from anyone who has bribed a foreign official lately. What, nobody raised their hand? Well, bribery isn’t something you usually put on your resumé or Facebook page.

What to do? The law firms we work with often start by talking to people and collecting their documents. What can you learn about how they deal with government officials? What do the documents show? What about the expense accounts and other money transfers?

Search is a key part of the answer. Modern search engines allow you to search millions of pages in a variety of languages with the click of a mouse. But what do you search for? That’s the hard part.

Traditional Search Doesn’t Cut It

Traditional Boolean search certainly has a part to play in an FCPA case but it isn’t always the most effective method. The reason is that we don’t know exactly what we are looking for, let alone the terms that might elicit those documents. Searching for “bribe*” within 10 words of “government official” probably won’t do the trick. Searching for the names of the government officials in question (assuming you know even that) might help.

Boolean search becomes even tougher in FCPA due diligence investigations because we are trying to prove a negative—that employees in the company to be acquired were not bribing public officials. That means counsel has to comb company and employee records to determine that nothing improper is going on. A tough assignment to say the least.

This is where a non-traditional form of search can be helpful.

Think about the traditional approach to search: You ask the documents specific questions and hope they answer with helpful information. This is a bit like the games of Fish or Battleship from our childhoods. We kind of know what we are looking for and the trick is to frame queries to find out if it is there. So, we think of key term variants and try to frame our searches to find good stuff.

“Give me all your schemes to convince government officials to give us the business.” Answer: “Go Fish.” Uggh, try again.

Let the Documents Speak to You

Now consider another approach to search, one that is more effective for these kinds of cases. Instead of questioning the documents, you let the documents speak to you and tell you their secrets. While the technique is still based on search, the approach is different. It can be far more effective when you are dealing with large volumes of documents and have no clear road map to follow.

“What does he mean?” you ask. “After all, documents don’t speak, they just sit there.”

I mean this. Modern search engines collect data about documents than can help shed light on what they contain and how they relate to one another. For example, in Catalyst CR we collect statistics about the metadata contained in the documents we index. Thus, if I were looking at files obtained from a particular office or custodian, I could quickly determine a lot of helpful information about their contents—without running a search. Our Correlation Navigators feature allows you to see a wide range of information in a view that might look like this:

(Click image to enlarge)

In this case our focus is on recipients after a search on the Enron documents. We can quickly see who is on the sending or receiving end of emails and better focus our review on those files.

Likewise, the system looks for information within the bodies of the documents and can show key concepts being discussed in the population. Here would be a view of some of the topics being discussed in a sample of the Enron population:

(Click image to enlarge)

In this case, the words are placed in alphabetical order but sized by relevance. They can provide important clues as you try and hone in your investigation.

With Catalyst Insight, our next release of our flagship product, we are taking investigation to a whole new level. Along with the field information we can provide about people and topics, we will provide investigators with new tools to allow the documents to speak to them.

Here is a timeline view, for example:

(Click image to enlarge)

With the timeline and the various field facets, a searcher can interact with documents and allow them to “speak” about their contents. The searcher can interact with each of the facets or drill down into the timeline to see how the communications flowed between the parties.

Investigators can use similar tools to track communications between parties, which can help guide their investigation. Here is an example from Insight:

(Click image to enlarge)

In this case, you can click on any individual to move that person to the center of the graph. Instantly, you can see who individuals are communicating with and how often. Click on the numbers and you can look at the actual communications. As you continue your investigation, you can go back and forth among individuals and documents.

There are a number of other techniques you can employ to speak with your documents. Clustering documents around themes can help you sort through large volumes of documents and focus on those that might matter. Finding “More Like These” from a key document set can take advantage of more complex queries than most researchers can hope to create on their own. These methods let the computer do that work based on complex algorithms and let the documents speak to you and help your investigation.

There are many more techniques we could discuss for FCPA investigations but this is a start. Suffice it to say that search is at the heart of these investigations and that search is typically more complex than we learned using Lexis or Westlaw. Mathematics, analytics and visual cues are important here as we both interrogate the documents and let them speak directly to us.

Regulating Bribes and Corruption is Catching on Globally

If you think this issue is of importance only to U.S.-based corporations, think again. It is true that, for many years, the United States stood alone in its efforts to police international corruption. The FCPA itself has been around for 30 or so years, although enforcement efforts have stepped up only recently. Some questioned our cheek in trying to tell people how to do business in other parts of the world.

Despite these misgivings, the idea of combating corruption is catching on globally. England recently enacted a similar law called the UK Bribery Act. It goes into effect July 1, 2011, and will impact U.S. as well as Asian companies. Recently, the U.K. Ministry of Justice published its guidance on procedures companies can put into place to protect themselves under the new act. (Also see our earlier post, Is Your Company Ready for the UK Bribery Act?)

Other European governments are following suit. In December 2010, Spain passed legislation allowing companies to be held accountable for criminal liability and making it a crime to bribe foreign officials. A DLA Piper publication provides background on this legislation.

The Asian region isn’t ignoring this issue either. Singapore has the Corrupt Practices Investigation Bureau (CPIB) which is dedicated to enforcing its Prevention of Corruption Act.

Darren Cerasi, director of I-Analysis in Singapore (one of Catalyst’s Asia Partners), reports that while the CPIB was set up primarily to prosecute government officials, its jurisdiction also extends to civilian bribery and includes the potential for both fines and jail time. Indonesia is another country that is focused on bribery and other anti-competitive acts.

Interestingly enough, China has an Anti-Bribery Law as well, which came into effect in December 2008. The Chinese Anti-Bribery law was amended in February this year to include making it an offense to bribe government officials outside of China and non-government officials too. It is expected to come into force in May.

Our China hand, Richard Kershaw, director of Catalyst Asia, says that China seems to be interested in bringing more accountability to government and its people. In August of 2008, it also passed an Anti-Monopoly Law which allows citizens to sue for monopolistic practices along with government enforcement. These laws apply to both foreign and Chinese domestic companies.

To read more about China’s laws, see:

Japan is in the game as well. In a recent case, prosecutors in Japan got four former senior executives of a Japanese company to plead guilty to bribing a Vietnamese transport official. The guilty pleas are especially noteworthy given Japan’s historical reputation as a jurisdiction where anti-bribery enforcement has been relatively lax. (See the Baker Botts FCPA Update.)

Without question, international counsel will be faced with more and more of these tricky and high-stakes investigations. Documents will be at the center stage and search will be the key to making sense of them.

Character Encoding: An Introduction for E-Discovery Professionals (Part 2)

By John Tredennick and Larry Barela

In part one of this post, we reviewed the history of character encoding, from the development of ASCII in the early 1960s to the eventual creation of an array of different code sets to accommodate an array of different languages. The effect of all these different code sets was to create a technological Babel that made it difficult to share and process data across borders.

For a time, all this special encoding worked passably well. In the early days, people were less concerned with passing documents from country to country. E-mail wasn’t anywhere near the universal communications medium it is today. Google hadn’t been invented and facebooks were still something students passed around college dorms.

By the early- to mid-1990s, however, people started feeling the pinch of all this encoding. A group of visionaries realized that the world needed some kind of universal encoding that could go beyond ASCII and embrace all possible languages. That realization was the impetus for the consortium that developed the Unicode Standard, the modern foundation for handling foreign language documents around the world.

Unicode was a big leap forward. Simply put, the drafters wanted to create a single character set that could be used to express every writing system out there. Having a sense of humor, they even made it big enough to encompass Esperanto (the failed universal language) and a Galactic language such as Klingon.

A lot of people think Unicode is another way to say double byte. That is not really true. Unicode gave a numeric value (hexadecimal actually) to each individual letter or character out there. The letter A is expressed as U+0041. The word “hello” is expressed this way:

U+0048 U+0065 U+006C U+006C U+006F

Since there are a lot more than 65,536 possible characters to express, Unicode can extend well beyond two bytes. In fact, the consortium has assigned more than 100,000 characters and has the ability to use up to four bytes if needed. With four bytes, the consortium had more than four-billion possible characters to play around with. That will cover English, Chinese, Arabic and a whole lot more.

The Unicode chart for CJK symbols and punctuation


A key thing to know about Unicode is this. The consortium kept the first 256 characters from the original and extended ASCII standards. By doing so, they made it easy for the Western world to adopt the new standard. All of their old encodings would work just fine. At the same time, they forced the rest of the world to change if they wanted to get on the Unicode bandwagon.

The core point about Unicode is that it can handle such a wide range of characters because it can use more than one byte to express characters. In the years since Unicode was introduced, it has become a global standard and is the encoding used in modern operating systems such as Windows, XML, the .NET framework, JAVA and the Mac OS X, among many others.

UTF-8: The Leading Brand of Unicode

The original Unicode offerings were built around two bytes of data, which, as you will recall, supported 65,000 or so possible characters. Some called this Unicode 2 (Unicode Transformation Format) for two bytes or UTF-16 for the 16 bits that are encompassed in two bytes of data.

This caused consternation among English-language programmers. Why? Because they faced the possibility of having to write two bytes of code to express language that only needed one byte. Why waste the extra bytes? Better to stick to ASCII programming, many thought, particularly when they weren’t programming for the international set.

The sentiment was strong enough that the consortium adopted a Unicode variant called UTF-8. This version allowed programmers to use a single byte to express the first 256 characters represented in ASCII. When you needed more, you simply added a second byte along with a signal that the computer should read two bytes for the next character rather than one. That covered the first 65,000 characters. For certain Chinese characters, UTF-8 encoding extends to three bytes in size. As we mentioned earlier, there is the possibility of a fourth byte, but its use is currently reserved.

Today, UTF-8 is the worldwide standard. Because of its versatility and consistency with ASCII, UTF-8 is steadily becoming the preferred encoding for e-mail, web pages, and other places where multi-language characters are stored or streamed.

Dealing with Non-Unicode

Even so, Unicode has yet to be adopted by everyone. Many Asian programs (including e-mail systems as well as HTML and text pages) are still expressed using proprietary code pages. Japanese e-mail systems often use a proprietary format like Shift-JIS for e-mail text, for example. Others still run older Exchange programs in a non-Unicode format, which can fool people who expect Exchange collections to be in UTF-8 format. Collection and processing of this data can be mangled easily if your computer is not set to recognize the proprietary format being used.

To make matters worse, while Exchange and its PST format is Unicode compliant, for some unknown reason the popular MSG format itself is not. The body of the e-mail messages you extract will typically be in Unicode format (although the administrator can choose otherwise). However, certain metadata fields are not kept in Unicode. Specifically, the Subject, From, To, CC and BCC fields are not kept in Unicode when they are expressed in the MSG format (they are kept in Unicode when stored in Exchange itself).

This is important for e-discovery professionals who have to collect, process, index and make this data viewable. If you extract MSG files from Exchange without setting your computer to handle the local encoding properly, you will likely mangle your data. Viewers will see the body of the e-mail properly but may see the those bothersome question marks, boxes and other characters that result from improper code mapping. By this time, it is too late to fix the problem. You have to recollect or at the least reprocess the data.

At Catalyst, we built an automated processing system that allows our clients and partners to submit raw files (including PSTs, NSFs and loose e-mail) directly to our system for processing. Where Asian data is collected, we encourage our users to specify the locale settings used to collect the data (which hopefully match the settings on the computer being harvested). By doing so, we can automatically route the collected data to servers set to the same locale. That way, non-Unicode data won’t be mangled during processing and can be sent on for indexing and review.

Beyond processing, it is equally important that your search and review platform be able to recognize and deal with non-Unicode files as well as the typical ASCII and Unicode data. Catalyst’s system is based on the FAST platform which includes special technology to recognize the special encodings used in non-Unicode documents and convert them to Unicode for indexing and search. The process isn’t always perfect because there is no perfect way to recognize which encoding has been used. The hope is that the programmer has specified which encoding is being used in the document, but that doesn’t always happen. Sometimes, the computer simply has to guess.

Thus, when you are dealing with Asian and other languages, you may find you have a problem with non-Unicode documents even if your system is “Unicode compliant.”

Language Detection and Encoding are Not the Same

Legal professionals sometimes confuse language detection with encoding. They are not the same. The difference is important.

Let’s start with language detection. A Unicode document, as you now know, may include text from many languages all in a single encoding. After all, that was the purpose for Unicode, to provide a single code page that could express every language.

So, when the computer is ingesting a Unicode document, it has to find a way to determine whether the document is in Spanish, English, French or some other language. You might think that Unicode would solve that problem, given that it assigns characters in each language to specific code points. However, the problem is that the different languages use common letters and often common words, sometimes with different meanings. How do we tell which language is being used?

For example, in English, the word “chat” means to talk informally. In French, chat means cat. If we see just “chat,” how do we know which language it is? Unicode doesn’t help us here because both languages use the common characters. The same is true with Asian as well as other languages. Japanese uses many characters taken from the written Chinese but sometimes gives them different meanings. And Chinese has two written forms: traditional and simplified. Many of the characters are the same but the meanings and usage can be different.

Our system uses a special language detection program from a company called Basis Technology to try and recognize the different languages being used in the documents as they are being processed and indexed. How they do it is a closely guarded secret. We know that the program analyzes the letters and other symbols being used, considers how they are used and whether there are special characters that are unique to a specific language, and even looks up words in a variety of dictionaries to determine the most likely language.

It helps to have more text when you are trying to determine languages used in a document. If you only have a few Chinese or Japanese symbols, the system might be confused, particularly if the symbols are common to both languages. With a paragraph of text, the system is less likely to be confused.

We ask our users to specify the language in which they are searching when the CJK languages (Chinese, Japanese or Korean) are involved. Unless you know the intended language, you can’t properly tokenize the characters so that they match the pages being searched. (Tokenization, which essentially means breaking up text properly into word units, is a topic for another day.)

What about non-Unicode encoding. If we can detect the encoding used in a document, will that tell us the language being used? Unfortunately, the answer is no. Knowing a document’s encoding can help narrow down the possibilities but won’t always tell you the language.

The reason for this is that a single encoding can encompass a number of languages. For example, let’s say we have a non-Unicode document with a code page that is used to construct Arabic, Farsi and Urdu. Assume we detect the encoding correctly, what does that tell us about the language being used? If we consider Arabic alone, it could be a number of languages. Here are a few possibilities: Persian, Urdu, Pashto, Baloch, Malay; Fulfulde-Pular, Hausa, and Mandinka (all in West Africa); Swahili (in East Africa); Brahui (in Pakistan); Kashmiri, Sindhi, Balti, and Panjabi (in Pakistan); Arwi (in Sri Lanka and Southern India), Chinese, Uyghur (in China and Central Asia); Kazakh, Uzbek and Kyrgyz (all in Central Asia); Azerbaijani (in Iran), Kurdish (in Iraq and Iran), Belarusian (among Belarusian Tatars), Ottoman Turkish, Bosniaks (in Bosnia), and Mozarabic.

Which language is being used? That is the province of language detection software. Encoding only helps get you started.

Why it Matters to Understand Encoding

In this flat-Earth era of globalization, multi-language documents are becoming a standard fixture of the e-discovery landscape. If you represent a multinational, you will undoubtedly be required to collect and review documents from a number of countries to determine relevance and privilege. If you sue a multinational, you will most likely receive non-English documents that you will have to master. Many documents will contain several languages. An e-mail, for example, might easily include combinations of Japanese, Chinese and English.

If you haven’t encountered this yet, you will—and soon. The more you know about the process and the pitfalls, the better prepared you will be.

Character Encoding: An Introduction for E-Discovery Professionals (Part 1)

By John Tredennick and Larry Barela

“There is something wrong with your system,” the angry lawyer on the phone said to Laura, the project consultant working on her case. “I am looking at the screen and all I see are a bunch of question marks and boxes,” she continued, getting more exasperated by the minute. “How am I supposed to review these documents if I can’t read the words?”

“Let me see if I can help,” Laura answered, trying to be as calm as possible. “Perhaps your computer is just using the wrong code page to display the text. If so, we can probably fix the problem with a mouse click,” she offered hopefully. “If not, there could be a problem with how your data was collected or processed.”

“A code page?” responded the caller. “What the heck is a code page?”

Our caller’s confusion was not unusual. After all, most of us went to school to study law rather than technology. Many lawyers still have little interest in knowing more about technology than how to turn on their computers.

But legal-technology professionals do need to know about code pages and character encoding, particularly as multi-language discovery becomes more common. The good news is that the subject isn’t that difficult. It is just a matter of taking it step-by-step.

In this two-part article, we provide a primer on what you need to know. In this first part, we review the development of ASCII, the basic standard for English-language programs, and discuss its limitations for non-English languages. In part two, we introduce you to Unicode, the standard that evolved to address ASCII’s limitations.

ASCII: The Base for English-Language Programs

To get a handle on character encoding and code pages, we need to start at the beginning, with ASCII, which, for our purposes, was the first character encoding set created for computing.

ASCII, pronounced “ask-ee” is the acronym for American Standard Code for Information Interchange. First developed in 1963 and finalized as a standard in 1968, ASCII was a system to encode the basic characters used by computers to communicate with people.

The task was to create a universal way to represent all of the basic characters one needed to use a computer—from writing programming code to drafting a research memo using a word processing program. And, because computers run off binary code (bits and bytes), ASCII needed to be expressed in bits and bytes as well.

In ASCII, each character you see on this page (and others you can’t see) is presented to the computer not as letters but as a “byte” of code. A byte consists of eight individual “bits” that are either a “1” or a “0.”

Thus, the letter “A” would be encoded in seven bits as: 100 0001.

The letter “B” is: 100 0010.

The letter “C” is: 100 0011.

And so on through the alphabet (large and small letters). ASCII also includes the 10 possible number values (0-9) along with standard punctuation characters ($ % * & + =, etc.). It reserves the first 32 characters to control things like tabs, line feeds and carriage returns.

You probably noticed that the letters shown above only use seven bits rather than the eight that make up a byte of code. As a historical anomaly, the drafters of the standard felt that 128 characters would be plenty to represent the letters, numbers and other “control characters” they would need. Yet most computers required eight bits as a minimum unit. So, they used the last bit for error checking. They figured they would never need it for anything else.

This ASCII system worked great in a world that spoke English and has held up well for more than four decades. Over the years, it became the base for text transcripts (“Could I get an ASCII copy of the transcript please?”), the core of most word processing programs, and the heart of most of the programming code used in litigation support applications.

There was only one problem with seven-bit ASCII. Having 128 possible combinations of 1s and 0s works fine if your alphabet only has 26 letters. But what if you want to compute in French or German or Russian or Hebrew? Even more fun, what if you are one of the billions of Chinese or Japanese or other Asian-language speakers blessed with a language that has tens of thousands of characters?

Extended ASCII

Not willing to change their mother language just to use computers, non-English-speaking programmers began developing their own character sets, extending the ASCII standard. Simply moving from seven to eight bits doubled the range of characters to 256, which helped for many languages. Most included the first 128 characters from ASCII but added other characters they needed in the extended range. This extra set of character points was often called “upper” or “extended” ASCII.

At the same time, English-speaking programmers started using the additional characters to support line drawings and horizontal and vertical bars so you could make spiffy drawings on your page. The IBM PC, for example, offered the OEM character set, which included some of the accented characters needed for European languages along with a bunch of line drawing characters that programmers used to make boxes and other rough graphics on those primitive DOS screens.

As you can imagine, this got a bit crazy. People from different countries started creating proprietary code sets to handle each of their languages. For the most part, they reserved the first 128 characters for the original English characters, which meant that English worked everywhere. After that, it was, “Katy bar the door!” There were thousands of code sets to choose from.

That meant your computer program had to know which encoding (code page) was being used. Why? Because different encodings used the same numerical value to express different characters. For example, here is how character 162 in upper ASCII displays using several different encodings:

What happened if your computer didn’t know which code page was being used? Simply put, you would get gibberish. The computer would render either nonsensical characters from a different language or those famed question marks and boxes that bedeviled our user in this story. Why question marks and boxes? Because the characters being used did not exist in the code set the computer was using to render them. Lacking a valid character to display, the computer displays question marks or, in some cases, the funny boxes.

For these special encodings to work well across borders, computer operating systems had to be able to recognize and handle each one and have the proper fonts loaded to display them. In addition, programmers had to properly label their code sets, which did not always happen. That left your computer confused and left you with those strange characters on your screen.

Double Byte Languages

To make matters even more interesting, some languages had so many characters that they couldn’t begin to fit within the narrow confines of upper ASCII. After all, the addition of the eighth bit to the ASCII standard provided only an additional 128 characters. That may work fine for French or Spanish, but Chinese written language has something like 65,000 symbols. Japanese and Korean also have a whole lot more than 128 symbols between them.

Lacking any other alternative, Asian programmers started using a second byte to express those languages. Adding another byte to describe the written symbols or characters you need changes the picture dramatically. Instead of 256 possibilities, you now have 65,536 options to play with. The number is simply a matter of two to the power of 16. Most of us remember how quickly the numbers add up when you start with two (representing a 0 or 1, which are the two possible states of an individual bit). Two to the eighth power is 256. Two to the 16th power is 65,536.

This resulted in the special “double byte” encodings that you hear about if you work with the Asian languages. “JIS” and “Shift JIS” are two such encodings from Japan. “Chinese Traditional” (aka Big 5) and “Simplified Chinese” are used to express the Chinese language. All are special encodings that will be difficult to view or understand unless your computer recognizes the proprietary code page.

As a quick example, here is a text page that is not being rendered properly. You can see some English here but also boxes and other strange characters.

Viewing a text page using Internet Explorer's auto-select encoding function.

Clearly, we have chosen the wrong encoding. Actually, the computer chose the wrong encoding. In this case, Internet Explorer made the mistake in detecting the intended encoding using its “Auto-Select” function.

Here is how the same page looked after changing the browser encoding setting to Chinese Traditional (Big 5). In IE, you can do this by right-clicking the page and choosing the encoding option from the ones listed.

The same page viewed using Chinese Simplified (Big 5) encoding.

As you can see, the boxes and question marks are gone. Instead you see well-formed Chinese characters, in exactly the same form as when they were created.

All this special encoding worked well up to a point. But in an age of globalization, these multiple codes created a sort-of technological Babel, confounding our ability to easily share and process data across borders. On Monday, in part two of this post, we will discuss how this problem was addressed through the development of Unicode and what it all means for e-discovery.

Continue to part two.

Catalyst’s New Tokyo Data Center Will Serve Clients Throughout Asia

Catalyst CEO John Tredennick at the Tokyo data center.

As a news release today announced, Catalyst has opened a full-featured, high-security data center in Japan. The center, housed in the Equinix facility in the Shinagawa ward of central Tokyo, is designed to provide a safe haven for clients in Japan and throughout the Asia-Pacific region.

With the leading technology for multi-language search and review, Catalyst has long supported international law firms and multinational corporations in matters involving Asian-language data. Last year, Catalyst formally launched its Catalyst Asia division and opened offices in Hong Kong and Tokyo.

Catalyst has also established partnerships with a number of litigation support and data forensics companies operating throughout Asia. They include Ji2 Inc., with an office in Tokyo; LECG, with offices in Shanghai and Hong Kong; I-Analysis Pte Ltd., with an office in Singapore; D3 Forensics Ltd., with an office in Hong Kong; and Redeye Forensics, with an office in Seoul, South Korea.

As with Catalyst’s U.S.-based data centers, the Tokyo facility is highly secure and is capable of powering the heart of the litigation lifecycle, from initial processing through search, review, production and trial. Catalyst was one of the first e-discovery companies to offer true multi-language capabilities and is the only system to use full tokenization to allow language and locale selection in processing and search.

Read the full announcement: Catalyst Opens Tokyo Data Center to Serve Clients in Asia.

Is Your Company Ready for the UK Bribery Act?

For U.S. corporations already wary of the Foreign Corrupt Practices Act, they will soon have an even-stricter law to contend with. In April, the UK’s new Bribery Act takes effect. Some lawyers are saying it is the toughest anti-corruption act in the world. And it applies to any corporation that conducts any business in the UK, regardless of where it is incorporated.

“The UK will reinforce its reputation as one of the least corrupt countries in the world, when the Bribery Act comes into force in April 2011,” says a Ministry of Justice announcement. “The Act will ensure the UK is at the forefront of the battle against bribery and pave the way for fairer practice by encouraging businesses to adopt anti-bribery safeguards.”

The act will create a new corporate offense of failure to prevent bribery by persons working on its behalf. Companies can avoid conviction if they can show that they have adequate procedures in place to prevent bribery.

It will also make it a criminal offence for anyone to give, promise or offer a bribe and to request, agree to receive or accept at bribe, either within the UK or in a foreign country. The measure covers bribery of a foreign official.

There is no limit on the fine that can be imposed on those who violate the act. Violators also are subject to up to 10 years of imprisonment.

[Read more...]

Setting Up Review Workflows for Multi-Language Documents

The world is getting smaller. For large corporations, it is virtually certain that their operations span multiple countries. But it is no longer just large corporations that operate globally. These days, even small- and mid-sized businesses are likely to have international components.

When a business is international, then any legal matters involving that business are also likely to be international in scope. In the context of litigation or a government investigation, that means the matter is likely to involve documents in more than one language. Often, such cases will involve collections of documents in a number of different languages – or even single documents containing multiple languages.

In other words, multi-language documents are a fact of e-discovery life these days. For e-discovery professionals, processing and review of multi-language collections raise a number of issues. In this post, I want to talk about one – review workflow.

Language Identification

The start of any successful multi-language review begins with computerized language identification. While most platforms support language identification, they tend to vary greatly in efficiency. Language identification uses built-in dictionaries to identify the primary (and sometimes secondary) language present in a document. This information can be used to route documents to the appropriate reviewer or to distinguish which documents need to be sent out for translation before they can be reviewed.

In the case of Chinese/Japanese/Korean (CJK) documents, language identification is less precise than when dealing with a Western Language character set. Frequently, document headers, email formatting, and email signatures contain CJK text while the substantial portion of the record would be in a Western language. This minimal amount of text causes the entire document to be coded as CJK. To navigate this problem, use a search tool that can both tokenize and count Western versus CJK words within a given document. These numbers can help establish a baseline to determine the true language of a document.

Language Specific Batching

With the limited number of foreign language reviewers and the often high cost associated with obtaining their services, it is important to have a clear system for assigning foreign language documents.

While most foreign language reviewers can review both their native language and English, you don’t want them wasting their time reviewing a document that 90% of your other reviewers can read. For English documents, documents with an unknown language, or documents with no text, these should go to your English reviewers. Only documents containing a non-English language should go to your foreign language reviewers.

Keep in mind though, when reviewing by document families, if any document contains a foreign language, the entire document set should go to your foreign language reviewer.

Flexible Workflows

No language identification is perfect. Inevitably reviewers are going to come across documents they can’t read. That’s why it is important to have a flexible workflow when setting up your review.

Take advantage of a rule-based review platform to re-route documents to the appropriate reviewer. If an English reviewer comes across a Russian document, the platform should provide the ability to reassign that document without being lost in the shuffle. If no reviewer exists with that language proficiency, incorporate a system where a coder can tag a document for translation.