Extract from John Tredennick and David Sannar’s article “Using TAR for Asian Language Discovery”
In the early days, many questioned whether technology assisted review (TAR) would work for non-English documents. There were a number of reasons for this but one fear was that TAR only “understood” the English language.
Ironically, that was true in a way for the early days of e-discovery. At the time, most litigation support systems were built for ASCII text. The indexing and search software didn’t understand Asian character combinations and thus couldn’t recognize which characters should be grouped together in order to index them properly. In English (and most other Western languages) we have spaces between words, but there are no such obvious markers in many Asian languages to denote which characters go together to form useful units of meaning (equivalent to English words).