David Owen: U.S. and EU Regulators on Diverging Paths Over AI Training Data

Extract from David Owen’s article “U.S. and EU Regulators on Diverging Paths Over AI Training Data”

Many AI models have been raising concerns about bias and privacy. The EU enforces strict regulations for transparency, while the US prefers a more open, less restrictive approach, highlighting differing views on AI regulation. “As a result, there is now a rapidly expanding demand and market for usable AI training data and for innovative ways to capture more data and refine it to new applications. While the awesome size and diversity of data available to the public offer enormous potential and opportunity, the indiscriminate gathering and assimilation of data carries a variety of risks and policy concerns.”

Generative artificial intelligence (AI) has emerged as one of the most potentially transformative technological innovations of our time, and a race is on among governments and tech companies around the world to harness and control this fast developing and disruptive technology.

While most users of ChatGPT likely never consider the amount of training data (the dataset that is used to teach a model how to perform a task) that was assimilated in order to generate useful content in response to their prompts, it is an immense volume of material.

The training data used by GPT-4, OpenAI’s latest model, reportedly includes an incredible 1 petabyte of data, the equivalent of 1 million gigabytes, or roughly 22 times the Library of Congress’s entire book collection.

Read more here

ACEDS