Recently, we had an interesting problem to solve for a client of ours: In a case with some 922,000 documents produced by the opposing party, there were roughly 720,000 documents that had been scanned and OCR’d. These included a very large number of handwritten documents. Most of the handwritten documents were significant, with pages of notes written on legal pads and memo forms.
Our client asked for our help in identifying the handwritten documents from among all those that had been scanned. To help us get started, our client provided us with a folder of about 4,000 examples.
On our initial inspection, we could see that the OCR’d text for these documents was all gibberish and contained lots of white space. We knew that the documents were mostly English, with some Spanish. Fortunately, none were Japanese or Chinese, so we didn’t need to worry about filtering out Japanese and Chinese text.
The documents typically looked something like this:
To find the handwritten documents from among the 720,000 non-ESI documents that had been produced, here is what we did:
Using Language Detection
Catalyst’s software has automatic language detection that is run in the processing pipeline when the document is ingested. If the software does not see enough words to identify the language, it classifies the primary language as “Unknown.”
After running language detection and sampling the results, we confirmed that the documents that were classified as English or Spanish were not handwritten, and that the handwritten documents had been properly classified as “language = Unknown.”
In all, there were 169,000 documents that were classified as “language = Unknown.” In addition to handwritten documents, these documents also included file folders, documents that were not scanned well (sideways pages, gray on gray), graphs with lines, and some documents that did have typewritten text but for some reason were not identified as English.
Narrowing the Results
We could have stopped there, but decided to make further efforts to see if we could find a way to distinguish the handwritten documents from the bad scans and typewritten documents with “Unknown” language. To that end, we ran two additional types of analysis.
The first was a look at file size and character count to see if the blank pages and file folders could be separated by the fact that they had almost no text characters in them.
This turned out to be a fruitless exercise because many of the handwritten scans also had tiny character counts and minimal text file sizes. Fortunately, it’s just one click for a reviewer to quickly click “Next,” so the blank pages really didn’t present a significant problem.
The second step was to see if we could filter out the documents that actually had significant text in them. It took a few iterations to get it down, but we were able to come up with a method that worked.
We imported a list of the 300 most common words in the English language into Power Search, Catalyst’s batch-search utility, and ran them against all of the documents where the language was identified as Unknown. In the final iteration, we removed from the list the words most likely to be false hits in the handwritten documents (if, it, so, no, I, as, etc.), and we also eliminated a few words that appeared typed in handwritten memo forms (from, date, etc.).
(When we removed the short common words, it was because we don’t have Craig Ball’s “to be or not to be” issue and can search for all words, even those that are “stop words” in many systems. See Craig’s post, Come and Take It: Free Corpus to Test E-Discovery Tools. See also the comment to the post by Catalyst CEO John Tredennick.)
About 85,000 documents — half of those classified as Unknown — did not hit on any of the search terms, and they were clearly handwritten, blanks or bad scans. Then we looked at the documents that did hit on the common terms and foldered them by how many hits they had — one hit, two hits, three hits, etc. The ones that did hit on these common words were typically not the blanks, so the blanks were all in the zero-hits group.
Sidebar note: We know of at least one vendor that has announced software, ZyLab’s Visual Classification, that can identify and classify handwriting from the graphics files, without any text. This approach sounds very interesting.
As it turned out, the ones with just a few hits were either handwritten or bad scans; you didn’t start seeing typewritten pages until the number of hits increased significantly.
Of course, we have seen other cases in which other issues in an opposing party’s production caused many documents to contain gibberish, but that wasn’t the case here. So the method of searching for common words did work to let us pull out the typewritten OCR scans where the language was misclassified as Unknown.
The Bottom Line
A good way to identify handwritten documents in an opposing production that have been OCR’d with gibberish in the text files was to look for documents in which (1) language detection had not identified a primary language and (2) a search of the most common English words either did not hit on the text at all or only a small number of words hit.
For a related discussion involving OCR’d handwritten documents, see John Tredennick’s blog post, Does Bad OCR Make for Good TAR?