Extract from Ella Sherman’s article, “Frontier AI Models Struggle With Complex Legal Tasks, Benchmarking Study Finds.”
Gen AI models from Anthropic, OpenAI and other AI developers were evaluated on their performance in completing legal work in areas like insurance coverage and document review in Percipient’s latest benchmark.
What You Need to Know
- Although most of the models scored high in more simple task categories, they struggled to score above an average of 84 out of 100 points in the insurance coverage category.
- They collectively struggled to score above an average of 62 points in the employment category.
- Percipient founder Chad Main told Law.com that these lower scores show that foundational models are not quite ready to consistently perform well on certain subject-specific legal tasks.
Many frontier generative artificial intelligence models perform well when completing simple legal tasks like document review, but most struggle when it comes to more niche tasks, according to the How Frontier AI Models Perform on Real Legal Work report released Monday by managed review provider Percipient.
The report evaluated 16 models from Anthropic, DeepSeek, OpenAI, Google, Moonshot AI and xAI on four legal task types across litigation, transactional, employment and insurance practice areas and was roughly built based on OpenAI’s GDPval evaluation measuring AI performance on realistic tasks out of 100 points each.
Model outputs were graded by experienced legal professionals.