Resource Center

Using Catalyst CR to Search Japanese
Written by Jim Eidelman   

Written by Jim Eidelman

Catalyst CR is built using the FAST search engine. FAST is a highly-sophisticated, grid-based search engine developed by a Norwegian public company, which has since been acquired by Microsoft.

Aside from speed and scalability, one of the most attractive features about the FAST search engine is that it supports search in over 70 languages. Most important for purposes of this paper, FAST can search the CJK languages: Chinese, Japanese and Korean.

CJK languages are more difficult to index than Western languages because they don?t use punctuation or spaces to define word boundaries. Word boundaries are critical for computerized search because search engines create large indexes based on the words in the underlying files being searched.

When users run searches, the search engine goes to the index to find where words are rather than run through each of the individual documents. This practice of creating and using a "word" index is the reason modern search engines can bring back results quickly, even for searches involving millions of pages of documents. It is the method followed by Google, Westlaw and Lexis, among many others.

Read the full article