Invisible Errors: How to Know When an LLM Gets It Wrong

Executive Summary

Health economic and outcomes research sits at the intersection of evidentiary rigor and consequential decision-making, where a single misattributed parameter can distort a pricing recommendation, compromise an HTA submission, or undermine a formulary decision. The adoption of large language models in HEOR workflows is accelerating precisely because the field's workload demands it: model specifications, systematic literature reviews, and dossier preparation are time-intensive, and the promise of LLM-assisted generation is genuinely compelling. What this white paper argues, with evidence, is that the promise is structurally conditional in ways that current practice and current guidance have not adequately confronted.

The foundational problem is not that LLMs make errors. It is that LLM errors in HEOR are invisible in output format. A hallucinated transition probability is formatted identically to an evidence-based one. A fabricated cost parameter carries the same decimal precision and citation structure as a correctly extracted value. Published research demonstrates that GPT-4o fabricated approximately 19% of citations in specialized domain testing: a rate that implies roughly eight potentially invented entries in a forty-parameter evidence table, none of which will signal their own inaccuracy. ISPOR has confirmed that hallucinations in this domain are "very plausible and hard to detect." This is not a problem that careful prompting resolves. It is a structural property of how these models generate text, and it places the entire burden of detection on the reviewer: requiring not a targeted audit of anomalies, but exhaustive return-to-source verification of every extracted value.

That verification burden eliminates the productivity gain. When a model specification that would take three weeks to draft takes three days with LLM assistance, the acceleration is real. But if every parameter in that specification requires independent source verification before it can be trusted in a submission context, the time saved during generation is consumed during review. Effort has not been reduced; it has been redistributed from creation to verification, against an output designed to look correct regardless of accuracy. In Markov models, where an incorrect transition probability propagates through every time cycle across a multi-year horizon, detecting an error at the output stage means reconstructing the evidentiary chain from the beginning. That is not quality control. That is reconstruction.

Current guidance from NICE, ELEVATE-GenAI, and Canada's Drug Agency establishes a necessary baseline: requiring disclosure of AI use, reproducibility of results, and human oversight. These are the right requirements. What they do not specify is the engineering layer that makes those requirements achievable: automated validation at the point of parameter extraction, parameter-level traceability linking each value to its source publication, and task decomposition that isolates error-prone stages before outputs propagate downstream. Transparency about which LLM generated a document does not help a reviewer identify which values are wrong. It documents that the problem exists without providing the mechanism to address it.

The strategic implication for HEOR teams and decision-makers evaluating LLM integration is direct: productivity gains from LLM-assisted workflows are achievable, but only when generation and validation are structurally integrated rather than sequentially separated. Organizations that deploy LLMs at the generation stage without investing in that validation architecture are not accelerating their workflows. They are accelerating the creation of documents that require the same work to be done twice.

The Invisibility Problem: Why LLM Errors in HEOR Are Structurally Undetectable from Outputs Alone

The most consequential characteristic of LLM errors in health economic work is not their frequency. It is their appearance. A hallucinated transition probability reads identically to an evidence-based one. A fabricated cost parameter presents itself with the same formatting, decimal precision, and source citation as a correctly extracted value. A utility weight misattributed to the wrong health state sits in a parameter table with exactly the same structural confidence as every other entry.

This is not a limitation that careful prompting resolves. It is a property of how LLMs generate outputs. The model produces fluent, plausible, well-formatted text regardless of whether the underlying claim is accurate. In HEOR, where the currency of the entire workflow is precise numerical values drawn from specific evidence sources, this creates a verification problem that is qualitatively different from anything the field has previously managed.

What Makes HEOR Particularly Exposed

Human errors in health economic modeling are typically detectable. A miskeyed value breaks a formula. An inconsistency between a parameter table and its model implementation produces an implausible result. A missing citation prompts a reviewer to ask where the value came from. Traditional errors leave traces.

LLM errors do not. ISPOR sources confirm that hallucinations in this domain are "very plausible and hard to detect," potentially making LLMs "risky or unreliable for specific tasks" [2]. A 2025 primer on generative AI in HEOR notes that hallucinations are especially prevalent in niche domains requiring complex, specialized reasoning, precisely the kind of inference health economic modeling demands [4]. Roche Diagnostics flags "hallucinations in the form of fake citations" as an active risk in LLM-assisted systematic literature reviews [1].

The scale is illustrated by published citation verification research: in specialized domains, GPT-4o fabricated approximately 19% of citations in one experimental study [3]. In a parameter table with 40 evidence-sourced values, that rate implies roughly eight potentially invented entries, none of which will announce themselves as fabricated.

The Burden of Proof Shifts Entirely to the Reviewer

Error Type	Traditional Human Error	LLM Hallucination
Visibility in output	Often leaves inconsistency traces	Indistinguishable from correct output
Detection method	Logical cross-check or formula audit	Requires return to source publication
Reviewer starting assumption	"Flag anomalies"	"Verify everything"
Scalability of review	Practical for most HEOR outputs	Not scalable for complex model specifications

This asymmetry is the structural problem. When an LLM generates a 200-page model specification with parameter tables and evidence citations, the reviewer cannot apply the normal heuristic of flagging what looks wrong. Everything looks right. The only reliable verification strategy is exhaustive: return to every cited source, confirm every extracted value, verify every attribution.

That exhaustive review consumes the time that LLM generation saved. Effort has not been reduced; it has been redistributed from creation to review, with no net productivity gain. In HEOR, where model specifications include dozens of interdependent parameters, cascading health state transitions, and sensitivity analyses spanning hundreds of variable combinations, this burden is neither practical nor scalable.

The question is not whether LLMs make errors in HEOR work. They do [1][2][3]. The question is whether the workflow is designed to find those errors before they propagate into a pricing decision, an HTA submission, or a formulary recommendation. Post-hoc review, no matter how rigorous, is not that design.

The Guidance Gap: What NICE, ISPOR, and Current Frameworks Get Right and What They Miss

If LLM errors in HEOR are structurally invisible in outputs, the natural question is whether existing guidance frameworks address the problem. They do not, at least not at the level where it matters.

The short answer: they acknowledge the risk, then prescribe transparency measures that do not address it.

What the Frameworks Get Right

NICE's 2024 position statement is the most substantive HTA-specific guidance published to date. Its requirements are sensible: full disclosure of AI use, reproducibility of results, appropriate human oversight, data protection compliance, and a clear mandate that AI-generated evidence must meet the same quality standards as traditional methods [5]. ISPOR's ELEVATE-GenAI framework extends this with a structured reporting checklist spanning ten domains, including accuracy, reproducibility, and fairness [3][6]. Canada's Drug Agency adapted the NICE principles in April 2025, reinforcing the same transparency-first approach [6].

These are not trivial contributions. Before NICE issued its position, submitters had no formal signal on whether AI-assisted dossiers would be evaluated differently. ELEVATE-GenAI gives HEOR teams a structured vocabulary for documenting LLM involvement.

The problem is what they do not specify.

The Structural Mismatch

Requirement	Current Guidance Coverage	What Is Missing
Declare AI use and version	Mandated (NICE, ELEVATE-GenAI)	Not missing
Reproducibility of outputs	Required (NICE)	Mechanism for achieving this unspecified
Human oversight	Required (NICE, ELEVATE-GenAI)	Positioned at review, not generation stage
Parameter-level source verification	Not addressed	No published guidance mandates this
Automated validation at generation	Not addressed	No published guidance mandates this
Task decomposition with checkpoints	Not addressed	No published guidance mandates this

The Adoption Gap Compounds the Problem

As of early 2025, only NICE and Canada's Drug Agency have issued formal AI guidance across 22 or more assessed HTA agencies [7]. PBAC and IQWiG have not published AI-specific frameworks. Where guidance does exist, it is structurally oriented toward reporting rather than prevention.

Here is the core mismatch: current guidance assumes that requiring transparency and human oversight provides sufficient quality assurance. But if hallucinated parameter values are indistinguishable from valid ones in output format, knowing which LLM generated a document does not help a reviewer identify which values are wrong. Transparency about the tool used does not solve the verification problem. It documents that the problem exists.

NICE's requirement that AI-generated evidence meet "the same quality standards as traditional methods" is the right standard [5]. It does not specify how that standard is achievable given hallucination invisibility. Neither ELEVATE-GenAI nor any other published framework mandates automated validation at the point of parameter extraction, parameter-level traceability linking each value to its source publication, or task decomposition that isolates error-prone stages before outputs propagate downstream.

LLM adoption in HEOR is advancing faster than guidance frameworks can evaluate. Consultancies are using LLMs for evidence synthesis, model specification, and dossier preparation, while most major HTA bodies remain in a monitoring phase [5][7]. Submitters are deploying LLM-assisted workflows at scale while frameworks for evaluating those workflows remain focused on procedural disclosure rather than structural validation.

I read the current guidance as a necessary first step. The guidance bodies identified the right concerns and established a baseline. What they have not yet required is the engineering layer that makes those concerns addressable in practice. That layer is where the work actually needs to happen.

The Redistribution Fallacy: Why LLMs Without Structural Safeguards Will Not Create Productivity Gains in HEOR

The practical consequence of invisible errors and insufficient guidance is straightforward: without structural mechanisms to catch hallucinated parameters, LLMs do not deliver productivity gains in HEOR. They redistribute effort.

The distinction matters. Redistribution feels like productivity because the creation phase accelerates. A model specification that would take three weeks to draft takes three days with LLM assistance. The error is assuming that acceleration is the gain. If the output requires exhaustive manual verification before it can be trusted, the time saved during generation is consumed during review. Net improvement: zero. And the reviewer is now working against an output designed to look correct regardless of accuracy, which is harder than working from a blank page.

The 73% Problem

The most cited data point on LLM performance in health economic modeling comes from Reason et al. (2024): 73% of automatically generated cost-effectiveness models were completely error-free [8][9]. The framing is optimistic. It should not be.

A 27% error rate in model code is manageable because code errors are detectable. A unit conversion error breaks an R script; an incorrect formula produces an implausible result. These errors surface during technical quality control and require, on average, under ten minutes to correct [8]. The error type is self-announcing.

Parameter extraction errors do not self-announce. A fabricated transition probability, an invented cost value, or a utility weight misattributed to the wrong health state occupies the same cell with the same formatting and the same source citation as every correct entry around it. The reviewer has no signal. The only reliable detection method is returning to the source publication and verifying the value directly.

In a model specification with 60 evidence-sourced parameters, applying that standard to every entry is not a quality check. It is a full reconstruction of the evidence review process. The LLM did not save time. It created a document that requires the time to be spent again.

The Compounding Effect

Error Stage	Where Error Enters	Downstream Impact
Parameter extraction	Transition probability, cost, utility value	Propagates through all model calculations
Model specification	Structural assumption or health state definition	Invalidates multiple downstream parameters
Evidence attribution	Source citation	Reviewer cannot verify correctness without source access
Sensitivity analysis	Incorrect base-case value used as reference	All scenario outputs anchored to wrong value

In Markov modeling, an incorrect transition probability extracted at the specification stage propagates through every time cycle, inflating or deflating cumulative costs and QALYs across a multi-year horizon. Detecting it at the output stage requires tracing backward through the entire chain. That is not review. That is reconstruction.

This is why the productivity argument for LLMs in HEOR, as currently practiced, fails structurally. The gains are real but conditional: conditional on the absence of undetected parameter errors, which cannot be assumed given what we know about how LLMs generate outputs in specialized domains [1][4].

The right question is not whether LLM-assisted HEOR work is faster at the generation stage. It is whether total workflow time, including verification, is shorter, and whether output confidence is maintained or degraded. Without structural safeguards, the answer to both is unfavorable. Productivity is not created by accelerating generation. It is created by designing workflows where generation and validation are integrated, so that review confirms what has already been verified rather than detecting what was never checked.

The Structural Solution: Four Architectural Prerequisites for Responsible LLM Use in HEOR

If the problem is structural, so is the solution. What responsible LLM deployment in HEOR requires is not better policies alone, but an engineering specification for how validation gets built into the workflow.

There are four mechanisms. They are not review steps appended to an existing workflow. They are architectural prerequisites that must be present before LLM generation begins.

The Four Prerequisites

1. Automated validation at the point of generation. Parameter extraction must be verified against source publications at the moment of generation, not during dossier review weeks later. The SourceCheckup framework demonstrates an automated agent-based pipeline that validates whether LLM-generated medical claims accurately reflect their cited sources, achieving high agreement with licensed medical experts [10]. The critical design principle is timing: validation embedded at the extraction stage prevents fabricated values from entering model specifications at all. Post-hoc validation requires working backward through an already-propagated error chain.

2. Parameter-level traceability. Every parameter, assumption, and structural decision must carry documented provenance: the source publication, extraction method, page reference, and alternative values identified in the literature. Knowledge-grounded frameworks in clinical decision support confirm the feasibility of this architecture [12]. In HEOR, traceability transforms the reviewer's task from "does this look right?" to "verify the documented chain." These are not equivalent cognitive tasks, and only one of them scales.

3. Task decomposition into discrete verifiable stages. End-to-end LLM generation, from clinical concept to submission-ready model specification, is the highest-risk configuration because errors at early stages propagate invisibly through every subsequent one. The LATCH framework demonstrates the governing principle: LLMs handle semantic reasoning while deterministic systems execute statistical operations, preventing error propagation across the boundary [11]. Applied to HEOR, the decomposition boundary sits between evidence synthesis, where LLMs add genuine value, and parameter embedding, where validation must be structural.

4. Human decision points at judgment-critical junctures. This prerequisite is frequently misapplied. Current workflows position human review at the end, scanning a finished model for anything that looks wrong. The correct position is at specific, identifiable decision points: choosing between conflicting evidence sources, deciding on structural assumptions, interpreting ambiguous clinical endpoints. Conflating judgment with verification produces exhaustive review burdens that still miss invisible errors.

Prerequisite	Where It Operates	What It Prevents
Automated source validation	Point of extraction	Fabricated parameters entering model specification
Parameter-level traceability	Embedded in output structure	Exhaustive unguided verification burden
Task decomposition	Between workflow stages	Error propagation across the modeling chain
Human decision points	Judgment-critical junctures only	Misuse of human expertise on automatable tasks

The distinction across all four is consistent: these mechanisms operate during generation, not after it. Post-hoc review can confirm what structural safeguards have already verified; it cannot replace them.

Published research confirms the technical feasibility of each component [10][11][12]. What has not been implemented is their integration into an operational HEOR workflow with validation at each stage boundary. That implementation gap is where responsible LLM deployment in HEOR currently fails.

Implementation: What HEOR Teams, HTA Bodies, and Guideline Developers Should Do Now

The architectural prerequisites are clear. The question now is who acts on them, and when. The answer differs by stakeholder, but the urgency does not.

For HEOR Teams Currently Using or Evaluating LLMs

The first action is diagnostic. Before expanding LLM use to additional workflow stages, audit existing LLM-assisted outputs for traceability gaps. For each parameter in a recent model specification, trace the value back to its cited source and verify it was correctly extracted. Document every discrepancy. This exercise will reveal whether your current workflow is producing invisible errors at a rate that matters. It is not a permanent process; it is a calibration.

The second action is structural. Redesign the workflow around verifiable stages rather than end-to-end generation. The decomposition that matters most is the boundary between evidence synthesis and parameter embedding: LLMs can accelerate literature review and evidence mapping; parameter extraction requires automated source verification before values enter a specification. Implement that boundary explicitly. Do not treat it as a review step. Treat it as a workflow gate.

The third action is documentation. For every parameter in a model specification, document its provenance: source publication, table or page reference, extraction method, and any alternative values identified. This does not require new technology. It requires a structured template applied consistently. When documentation is in place, reviewers verify a documented chain rather than performing exhaustive re-verification from scratch. The difference in reviewer time is substantial.

Action	Who Does It	When	What It Prevents
Traceability audit of existing outputs	HEOR team lead	Before next submission	Hidden parameter errors in current dossiers
Source verification gate at extraction stage	Workflow architect	Before expansion of LLM use	Fabricated values propagating into specifications
Structured provenance documentation	All modelers	Immediately	Exhaustive unguided review burden
Human decision points at assumption junctures	Senior methodologist	At specification sign-off	Structural errors invisible to automated checks

For HTA Bodies and Guideline Developers

NICE's requirement that AI-generated evidence meet "the same quality standards as traditional methods" is the correct standard [5]. The problem is that it does not specify how that standard is achievable when hallucinated parameters are indistinguishable from valid ones in output format. Future guidance should close that gap.

Specifically: guidance should mandate parameter-level traceability as a submission requirement. If a model specification includes extracted parameter values, each value should carry documented provenance. This is not an unreasonable burden on submitters; it is a documentation standard that responsible workflows should already be meeting. ISPOR's ELEVATE-GenAI framework establishes reporting principles [3]; the next version should extend those principles to specify what evidence of structural validation is required, not just what should be disclosed.

HTA bodies should also distinguish between tasks where LLM autonomy is appropriate and tasks requiring structural validation. Literature synthesis and evidence mapping are lower-risk applications. Parameter extraction and evidence attribution are higher-risk. Guidance that treats LLM use as a single category misses the task-level variation in error probability.

The standard for evaluating LLM-assisted submissions should not be whether the output passes review. It should be whether the workflow that produced it was designed to catch errors that reviewers cannot see. That is the question HTA bodies should require submitters to answer.

Conclusion: The Question Is Not Whether LLMs Make Errors. It Is Whether Your Workflow Finds Them.

The argument is singular and structural. LLM errors in HEOR are invisible in output format. Current guidance frameworks acknowledge the risk but do not mandate the engineering mechanisms needed to address it. Without those mechanisms, LLMs redistribute effort rather than reduce it. And four architectural prerequisites exist that would change this, all of which are technically feasible and none of which are currently standard practice. What that means in practice is direct.

The Window of Risk Is Open Now

LLM adoption in pharmaceutical HEOR is not a future scenario. Consultancies are using LLMs for evidence synthesis, model specification, and dossier preparation today. As of early 2025, only NICE and Canada's Drug Agency have issued formal AI guidance across more than 22 assessed HTA bodies [7]. PBAC and IQWiG have not published AI-specific frameworks. ISPOR's ELEVATE-GenAI establishes reporting standards but does not mandate structural validation at the point of generation [3].

This asymmetry, between adoption pace and governance maturity, represents a period of genuine risk. Outputs informed by LLM-assisted workflows are reaching decision points: pricing analyses, HTA submissions, formulary recommendations. The workflows producing them are not, in most cases, designed to detect the errors that are invisible in output format. That is not a hypothetical concern. It is the current state.

The Architecture Is the Argument

I want to be precise about what this paper is not arguing. It is not arguing that LLMs cannot be used responsibly in HEOR. It is not arguing that human workflows are error-free. Published baselines on human data extraction show 63% of human-extracted records contain at least one error. The argument is not human versus AI.

The argument is structural: responsible LLM use in HEOR requires deliberate workflow architecture, not procedural oversight appended after generation. The four prerequisites, automated validation at the point of extraction, parameter-level traceability, task decomposition into verifiable stages, and human decision points positioned at judgment-critical junctures, are not enhancements to existing workflows. They are conditions under which LLM-assisted HEOR work can be trusted.

	Without Structural Safeguards	With Structural Safeguards
Productivity	Effort redistributed from creation to verification	Net productivity gain achieved
Review quality	Reviewer works against output designed to look correct	Reviewer verifies documented provenance chain
Error propagation	Errors propagate invisibly across modeling stages	Errors intercepted at stage boundaries
Human expertise	Human expertise consumed by exhaustive checking	Human expertise applied at judgment-critical junctures

What the Field Should Demand

The standard for LLM-assisted HEOR work should not be whether the output passes review. Reviewers cannot reliably detect hallucinated parameters from a finished document, regardless of how carefully they look [1][2]. The standard should be whether the workflow that produced the output was designed to catch errors that reviewers cannot see.

That is the question HEOR teams should apply to their own processes before the next submission. It is the question HTA bodies should require submitters to answer. And it is the question guideline developers should build into the next iteration of frameworks that currently stop at transparency and disclosure.

The productivity promise of LLMs in HEOR is real. It is also conditional. The condition is structural: build the validation architecture first. The question was never whether LLMs make errors. They do [1][3][10]. The question is whether your workflow finds them.

References

How AI is making HEOR even faster - Roche Diagnostics. diagnostics.roche.com
Using Generative Artificial Intelligence in Health Economics and Outcomes Research. pmc.ncbi.nlm.nih.gov
ELEVATE-GenAI: Reporting Guidelines for the Use of Large Language Models. sciencedirect.com
Using Generative Artificial Intelligence in Health Economics and Outcomes Research. springer.com
Generative AI for Health Technology Assessment - PMC. pmc.ncbi.nlm.nih.gov
ELEVATE-GenAI: Reporting Guidelines - ISPOR. ispor.org
Acceptance of Artificial Intelligence in Evidence and Dossier Developments by HTA Bodies. inizio.com
Artificial Intelligence to Automate Health Economic Modelling - PMC. pmc.ncbi.nlm.nih.gov
Generative Artificial Intelligence and the Future of Health Economic Modeling - ISPOR. ispor.org
An automated framework for assessing how well LLMs cite sources. nature.com
An LLM-assisted framework for accelerated and verifiable clinical research. medrxiv.org
A Hybrid Knowledge-Grounded Framework for Safety. arxiv.org