· Lex Hamilton
Transcript-faithful is not the same as clinically correct
A transcript-faithful clinical note can still be clinically wrong. Why that gap matters for AI governance, from Orinyx, an independent safety layer for clinical AI.
In August 2025, Abridge published a whitepaper on how it catches confabulations in AI-generated clinical notes. It is a careful document, more transparent than most of what vendors put out, and worth reading if you run clinical AI at a health system. It also draws a line that every CMIO evaluating ambient documentation should understand before signing a deal, because the line marks exactly where a vendor's safety work ends and your exposure begins.
Here is the line, in Abridge's own words. Their guardrail measures whether each claim in the note is supported by the transcript of the conversation. The transcript is the reference. A note that says a patient has diabetes counts as unsupported if the conversation does not support it, even when the patient genuinely has diabetes. Their guardrail is not checking the claim against the patient's chart, the FDA label, or clinical reality. It is checking the claim against the recording of the room.
That is a real and useful thing to measure. It catches a large class of errors: the model that hears Lexapro after a patient corrects Prozac and writes Prozac anyway, the misattributed quote, the dropped medication. Abridge reports its purpose-built model catches 97% of these against an internal benchmark, compared to 82% for an off-the-shelf model. Inside the task they defined, that is strong work.
But notice what the task cannot reach.
The error that originates in the room
Consider an emergency physician, mid-shift, who verbally states a dose that is wrong. Not a transcription error. The clinician said it, the scribe heard it correctly, and the note records it faithfully. By the transcript standard, that claim is directly supported. It passes every guardrail Abridge describes, because the guardrail's job is to be faithful to the conversation, and it was.
The note is transcript-faithful. It is also clinically incorrect.
This is not a flaw in Abridge's system. It is a property of what the system was built to do. A transcript-fidelity check cannot catch an error that lives in the transcript, any more than a faithful court reporter can catch a witness who misremembers a date. The reference material has no opinion on whether the dose contradicts the drug's labeling, whether the interaction is real, or whether the monitoring protocol applies. Those questions require a different reference: external clinical authority, not the conversation.
When the only thing checking a clinical claim is fidelity to the conversation that produced it, contradictions of the FDA label, reversed guidelines, and dose errors spoken aloud all flow straight through, correctly transcribed and unflagged.
Why the vendor's own benchmark cannot close this for you
Abridge is candid about the rest. The whitepaper states plainly that no guardrail is perfect and that clinician review remains essential. That honesty is to their credit. It is also the part procurement teams tend to skip past on the way to the 97% headline.
The 97% deserves a second look, not because it is wrong, but because of who produced it. Abridge defined the benchmark, curated the dataset, selected the single comparator, and scored the result. Every one of those choices was made by the party whose product was being measured. A self-administered benchmark measures what the builder chose to measure, under conditions the builder chose to set. It can be entirely accurate and still tell you nothing about how the system performs against claims the builder did not think to test, or against the standard the builder was not trying to meet.
This is the structural issue, and it is not specific to Abridge. When the organization generating the AI output is also the organization grading it, you do not have a check. You have a self-assessment. Self-assessments are useful. They are not the same as verification, and a hospital board, a malpractice carrier, or a regulator asking how you know your AI is safe is asking for the second thing.
What this means for the people evaluating these tools
None of this argues against ambient scribes. The burnout reduction is real, clinician retention with these tools is high, and the documentation burden they relieve is one of the genuine wins of clinical AI so far. The argument is narrower: a vendor's internal safety layer answers the question the vendor scoped, and that question is not the question your patients' safety turns on.
So before the next ambient AI contract, two things are worth pinning down. First, what reference does the vendor's safety check actually use, the transcript or external clinical truth? Second, who graded the accuracy number, and would it survive someone outside the company running the test against cases the company did not pick?
If the answer to the first is the transcript and the answer to the second is we did, that is not a reason to walk away. It is a reason to know precisely where the vendor's guarantee stops, so that you can decide what sits in the gap between transcript-faithful and clinically correct, and who is accountable for it.
That gap is not a vendor problem to be embarrassed about. It is a governance question for the health system to answer on purpose.
Further reading
- Abridge, The Science of Confabulation Elimination (August 2025). The whitepaper discussed here. Pages 9 and 27 are the ones to read: the transcript-as-reference standard, and the internal benchmark.
- Grolleau F, et al. Physician-Reported Safety Outcomes of AI-Generated Hospital Course Summaries. JAMA Network Open (2026). A Stanford pilot that rated AI summaries broadly safe, and whose authors state they plan to build a system to evaluate vendor AI tools. Independent verification of vendors, proposed by the people who ran the study.
- Bell SK, et al. Frequency and Types of Patient-Reported Errors in Electronic Health Record Ambulatory Care Notes. JAMA Network Open (2020). Roughly 1 in 5 patients found a mistake in their notes, and 42% of those rated it serious. Documentation error predates AI.
- Shahbodaghi A, et al. Documentation Errors and Deficiencies in Medical Records: A Systematic Review. Journal of Health Management (2024). Errors present in 47 of 48 studies reviewed, with a call for automatic error-detection systems.