Extract from Laer AI’s article “Why Confidence Scoring with LLMs is Dangerous”
Before relying on confidence scoring of LLMs in a setting of document review, there are certain things you must know.
Let’s return to confidence assessments from LLMs and review the consequences of relying on them. Returning to our post on Assessments, recall how essential scoring of predictions is and that the most important thing is not the scores themselves but the resulting ranking that these scores create between all predictions. We use scores to make rankings; we’ve been doing that since the early days of TAR 1.0.
Once our model (whichever kind, TAR 1.0, 2.0/CAL or LLM) gives us a score, we rank examples, and draw a cut-off line.
The way we decide is by drawing a line to separate what the model would call Responsive and Not Responsive. The resulting separation would obviously sometimes have errors, whether false positives or false negatives.