Everything’s New—Except Production Formats

February 22, 2024/ACEDS Blog, Data and Technology, Document Review

Share this article

Historically, productions come in the form of single page images along with loadfiles containing select metadata (and occasionally native documents for items that couldn’t be imaged). The images typically had Bates labels and confidentiality designations burned in and sometimes contained redactions.

Changes

Several years ago, we switched from 300 dpi CITT Group 4 black and white compression of TIFF images (which meant no grey!) to allow for color in the form of color TIFFs or JPGs. The move from black and white to color had some TV-in-the-sixties vibes, but it was a great improvement. But it is true that some eDiscovery platforms still can’t accept color TIFF files or produce color documents.

But TIFF files are frequently converted to PDF by litigators because this format supports raster and vector images (or combinations), embedded text, and many other formats. But most eDiscovery protocols don’t allow PDFs.

This could be because for a long time, some platforms produced “near native” documents by converting them to PDF. The text in the document was in vector form but metadata was lost by converting a non- paginated email to a paginated PDF with links, etc. This might have been why PDFs weren’t quickly adopted as the preferred production format. And, as computer scientists, we’re still not entirely sure what “near-native” means so it’s understandable.

What Do Modern Productions Look Like?

The ideal format to receive images is a multi-page PDF where a single document is represented by a single PDF. It is important to note that the PDF is a document and not a portfolio. If a “document” is an email and an attachment, the email should be a separate file from the attachment. The PDF should be named by its beginning Bates label (often called beginning production number).

Natives and Metadata

There should be between 50 to 60 extracted or derived metadata items when a data set is processed, but in our experience, productions may arrive with six to twenty fields. It’s important to understand that some metadata exists externally (file, date, path) but most exists inside the native and can either be directly extracted or derived from that data (we wrote a blog about this which can be found here).

There are several reasons why more metadata is better. A good example involves (you guessed it) – the increasing use of AI. One of the most useful metadata fields for modern AI workflows is automatically calculating potential privilege based on the domain names within an email address. If an email is between only the end client and the law firm, it’s probably privileged; however, if there is an additional email domain (like Gmail) it might be privileged. If the email contains the opposing party, it is most likely not privileged. Without the derived metadata of an email domain, this calculation can’t be made.

Production requests should include 60-70 metadata fields for the maximum use from a data set, and if opposing can’t produce those, it might be because they’re using a legacy tool.

Production of natives historically is only for documents difficult to image. This trend should change soon as the production of natives along with their images should become the preferred format. The reason productions should contain natives in addition to Bates stamped images is because there are still many legacy processing platforms that can’t extract the all the relevant internal metadata. Often, these processing tools mangle the external metadata as well. Producing native allows a receiving party to use a modern processing tool to extract and derive all the metadata that our hungry AI tools can consume.

Redaction

Some legacy tools producing redacted documents don’t produce redacted text – they produce no text for the redacted document so the receiving party has to OCR that document in order to be able to search it. QC protocols should catch missing document text or metadata and indeed, some modern tools enhance inbound productions and detect these anomalies and errors automatically. Production protocols should ID the redacted documents in the metadata file and should contain the information that would be provided in a privilege log (i.e., the reason for the redaction.)

Perhaps a fear of producing natives is due to redactions because if a redacted attachment is produced but the parent email in its native form, the attachment has been provided in its unredacted form also. The most common native redaction we see are Excel documents where certain words or phrases have been removed. In some cases, this redaction causes the functionality of the spreadsheet to cease but commonly these spreadsheets are client lists or financial numbers but not formulas that need calculations.

Delivery of Productions

Rarely but sometimes productions are shipped via overnight delivery on encrypted hard drives, sometimes with complicated PIN patterns. Sending and receiving produced data via download links is far more efficient. These links are password protected, usually in two ways – a password to get to the link and a password to unpack the files once in the system. Within this method of delivery, it is recommended to send a link to the production directly to whoever needs to download it, rather than the client downloading and then sending that downloaded file – things can “break” in the middle and it’s better to get that production from its source.

Rolling Productions

The allowance for rolling productions is a big step, but these are often missing one critical point when deduplication is used by the producing party: the earlier productions need to be updated by data that has been deduplicated after the primary was produced.

For example, if an email is produced under the custodian Jim in the first production, and then Jane’s email contains documents Jim and Jane exchanged, and Jane’s version is de-duped. When producing Jane’s documents, the production from Jim should be updated with all the docs that aren’t being re-produced for Jane in Jim’s production. The overlay should at least contain the fact that Jane had a copy of something produced with Jim’s documents, as well as the path to the document in Jane’s custodian.

Conclusion

eDiscovery productions will likely never stray far from .DAT files, partially because they are simple and understandable; spreadsheets with a variety of delimiters. Moving away from single page images and giving up the resistance to produce data in native form are the next evolutionary steps in eDiscovery production.

Dr. Gavin Manes

CEO at Avansic

Dr. Gavin Manes is a nationally recognized eDiscovery and digital forensics expert. He founded Avansic in 2004 after completing his Doctorate in Computer Science from the University of Tulsa. At Avansic, Dr. Manes is committed to high-technology innovation, research, and mentorship, and has several patents pending. Avansic's scientific approach to eDiscovery and digital forensics stems from his academic experience.

Dr. Manes routinely serves as an expert witness including consulting with attorneys on data preservation issues. He contributes academic content to peer-reviewed journals and delivers classroom lectures. See his full CV at gavinmanes.com.

Dr. Manes has published over fifty papers on eDiscovery, digital forensics, and computer security, countless blog posts, and educational presentations to attorneys, executives, professors, law enforcement, and professional groups on topics from eDiscovery to cyber law. He’s briefed the White House, the Department of the Interior, the National Security Council, and the Pentagon on computer security and forensics issues.

At the University, Dr. Manes formed the Tulsa Digital Forensics Center, housing Cyber Crime Units from local, state, and federal law enforcement agencies. He’s a founder of the University of Tulsa’s Institute for Information Security, leading the creation of nationally recognized research efforts in digital forensics and telecommunications security.