Extract from Isha Marathe’s article “Generative AI is Trained on Online E-discovery Resources. Here’s What That Means for E-discovery”
Even for the experts, there is much left to learn about generative artificial intelligence, primarily by the virtue of it being so new and so fast-evolving.
However, there’s at least one unanimous agreement across all sectors when it comes to the technology: What comes out depends strongly on what goes in.
And while the black box where the magic happens (i.e., the machine’s reasoning and decision-making process) may yet have to be cracked, more information about the “what goes in,” or the input, is beginning to emerge.
In the e-discovery sector specifically, the managing director of ComplexDiscovery Rob Robinson outlined a list of 55 e-discovery-centric resource domains that are included in Google’s Colossal Clean Crawled Corpus (C4) dataset. The C4 dataset is a rather large collection of Web pages crawled—or analyzed and indexed—by the CommonCrawl project, and serves as a vital bedrock of information to train large language models (LLMs) like OpenAI’s GPT models, Microsoft’s Bing chatbot and Google’s Bard, to name a few.