Extract from David Kalat’s article “Nervous System: From Gibberish to Unicode”
With the aggressive pace of technological change and the onslaught of news regarding data breaches, cyber-attacks, and technological threats to privacy and security, it is easy to assume these are fundamentally new threats. The pace of technological change is slower than it feels, and many seemingly new categories of threats have been with us longer than we remember. Nervous System is a monthly series that approaches issues of data privacy and cyber security from the context of history—to look to the past for clues about how to interpret the present and prepare for the future.
In an early landmark of eDiscovery case law, CP Solutions PTE, Ltd. v. General Elec. Co. (D. Conn. Feb. 6, 2006), the plaintiff objected to a variety of alleged defects in defendants’ production, including the production of supposedly thousands of pages of “gibberish.” The court ruled that, to the extent that the underlying documents were created or received by any of the defendants in a readable format, they must be produced for plaintiff in a readable, usable format.
From a standard of jurisprudence this seems an imminently reasonable conclusion. More interesting, though, is the technical question of why the gibberish appeared at all. While many system files and non-text-based electronic documents routinely appear as “gibberish” when rendered as printed text, an entire category of genuine written text-based communication can, in certain conditions, end up appearing as a nonsensical jumble of odd symbols and unreadable characters. To understand why this happens, and to solve for it, requires looking under the hood of how binary data handles text in the first place.