Herbert Roitblat: Do Language Models Memorize?

Extract from Herbert Roitblat’s article “Do Language Models Memorize?”

The New York Times (NYT) has joined the list of plaintiffs complaining that large language models, particularly the models from OpenAI, have been stealing copyrighted material. In understanding these complaints, it may be helpful to break them down into simpler legal and technical questions, taken roughly in the order in which they occur in the process.

  1. Did OpenAI access NYT articles to build its language model (technical question)
  2. Is this access (if it occurred) a fair use of those articles (legal question)
  3. Did OpenAI copy the articles into its models (technical question)
  4. Does it matter whether the models generate content similar to their training data versus retrieve that content? (legal question)

Background

Large language models are constructed of layered neural networks.  Each layer consists of metaphorical neurons that accept input values and produce an output value that depends on the inputs.  Each neuron in each layer sums the inputs that it receives from the previous layer and produces an output value to the next layer.  The degree to which each neuron affects each neuron in the next layer is controlled by a weight parameter.  GPT 4 is said to have about 1.76 trillion parameters.

The models are trained by exposing them to large amounts of text.  The training process adjusts the parameters to improve the accuracy of the model’s word (technically “token”) predictions.  Given a text (called a “context” or in production, a “prompt”), predict the word/token that follows.  The predicted word is added to the context and the process is repeated. Training continues until the predictions are accurate.

Read more here

ACEDS