April 28, 2026

Hallucinated vs. fetched: a GAIA case study on a verifiable question

Direct GPT-5.5 told us Nature published 878 articles in 2020. Direct Claude Opus 4.7 said approximately 900. Both were wrong. The actual count is 1,002 — and only one of three AI systems we tested bothered to look it up.

This post is about why that matters more than it sounds.

The question

A real GAIA Level 2 validation question (task ID 04a04a9b-226c-43fd-b319-d5e89743676f):

If we assume all articles published by Nature in 2020 (articles, only, not book reviews/columns, etc.) relied on statistical significance to justify their findings and they on average came to a p-value of 0.04, how many papers would be incorrect as to their claims of statistical significance? Round the value up to the next integer.

The gold answer is 41. The path to it has three steps:

Find the actual count of Article-type publications in Nature in 2020 (the answer is 1,002).
Multiply by 0.04, treating the assumed average p-value as the false-positive rate: 1,002 × 0.04 = 40.08.
Round up: 41.

Steps 2 and 3 are arithmetic. Step 1 is the entire test. GAIA is designed precisely to expose whether a system retrieves real information from the open web or invents a plausible number from its training data.

The setup

Same question, verbatim, to three systems. Default settings, no prompt engineering.

System	Final answer	Correct	Wall clock
SPARKIT (one POST to `/v1/research`)	41	✓	94s
Direct Claude Opus 4.7 (`/v1/messages`, high effort)	36	✗	8s
Direct GPT-5.5 (`/v1/chat/completions`, default reasoning)	36	✗	95s

Two findings sit on top of this table.

Finding 1: both direct LLMs invented Nature's article count

Direct Opus 4.7's reasoning:

"Nature published approximately 900 research articles in 2020 (excluding book reviews, columns, editorials, news, etc.)."

Where did the 900 come from? Nowhere. Opus 4.7 made it up. The actual count is 1,002 — off by 10%.

Direct GPT-5.5's reasoning:

"Using an article-only count of 878 Nature articles in 2020: 878 × 0.04 = 35.12 → ⌈35.12⌉ = 36"

GPT-5.5 made up a different number — 878. Off by 12%.

By coincidence, 878 and 900 both round up to the same final answer once you do the p-value math: 36. Two different hallucinated counts. One identical wrong answer. Both presented with the same confident prose, the same tidy formatting, and no caveat that the underlying number was fabricated.

If you read either response and were not familiar with Nature's 2020 publication volume, you would have no signal that the answer is wrong.

Finding 2: speed without retrieval is just fast wrongness

GPT-5.5 took 95 seconds to produce a wrong answer. SPARKIT took 94 seconds to produce the right one. Direct Opus 4.7 was much faster — 8 seconds — but the speed bought it nothing except a head start on confidently misinforming you.

When people talk about wanting "fast AI for research," they usually mean fast correct answers. A fast confident hallucination is worse than a slow correct retrieval, because it is harder to catch.

What SPARKIT actually did

SPARKIT's response shows its work. The relevant excerpt from the report:

I queried Nature's own archive, filtered to document type = Article and year = 2020:

https://www.nature.com/nature/articles?type=article&year=2020

The pagination block ends at page 51. By scraping each page and counting the article list entries:

Page # of articles
1 20
25 20
50 20
51 2

So pages 1–50 each contain 20 articles, and page 51 contains 2:

N = 50 × 20 + 2 = 1,002 articles

1,002 × 0.04 = 40.08

⌈40.08⌉ = 41

Page	# of articles
1	20
25	20
50	20
51	2

You can verify this yourself. Open that URL. Click through the pagination. Count. The work is reproducible end-to-end, which is the entire point — when SPARKIT says 1,002, you can audit it. When Opus 4.7 says "approximately 900," there is nothing to audit because there is no source.

Why this matters

The failure mode the two direct LLMs demonstrated above is the single most common reason "AI for research" does not work in practice for anyone who needs accurate, retrievable facts:

A meta-scientist writing about replication rates and false-discovery in Nature 2020.
A journalist citing the volume of high-impact publications in a calendar year.
A scientometrician computing per-journal indices.
A grant officer estimating field activity for resource allocation.
A reviewer or editor sanity-checking a manuscript's "Nature published X papers in 2020" footnote.

In every one of those use cases, the difference between "approximately 900" and 1,002 is consequential. And in every one of those use cases, the user has no in-band signal that "approximately 900" was hallucinated. Both direct LLMs answered with the same surface confidence as SPARKIT — clean prose, polished formatting, no flags about retrieval. The only way to tell one answer from another was to actually go check.

This is not a quirk. It is a structural property of how single-LLM systems answer questions whose answers live outside the model's weights. The model has a vague prior over how many articles Nature publishes per year, and when forced to give a number, it samples from that prior. The output is plausible. It is also, in this case, wrong by 10–12% — and confidently presented as fact.

The agentic difference

Direct LLMs answer from training data. They cannot fetch a URL, parse pagination, count list entries, or iterate on an intermediate result. If the answer requires going somewhere and looking, they invent.

An agentic research system can do all of those things. SPARKIT's 94 seconds went into actually visiting the archive, walking the pagination, counting, and computing. The wall clock is longer than Opus 4.7's because real retrieval takes time. That tradeoff is the entire point.

If you are building anything where "made up the number" is an unacceptable failure mode — and for most research, journalism, and decision-support workloads, it is — you want the agent. The single-LLM call is a tool. It is not a research tool.

Methodology

Question: GAIA validation set, task ID 04a04a9b-226c-43fd-b319-d5e89743676f, level 2.
Date: April 2026.
Each system was given the question once, verbatim, in a fresh context, with the settings noted in the table. No retries. No prompt engineering. No external tools wired in for the direct LLM calls — they had only their native capabilities.
Wall clock is end-to-end, from request submission to final answer.
Gold answer of 41 is from the GAIA validation set; the underlying article count of 1,002 is independently verifiable at https://www.nature.com/nature/articles?type=article&year=2020.
This is one question. A single data point is not a benchmark. We picked this question because the failure mode it exposes is representative of a category of failure that direct LLMs hit reliably whenever an answer requires retrieval rather than recall. If you want the multi-question version, HLE-Gold has 149 of them.

Where to go next

Try a verifiable question of your own — the playground in the dashboard is the same API. Pick something whose answer is on a public website and see what each tool does.
Start with Try-it — $10 for 5 queries, no subscription required.
Ship it into your agent — pip install sparkit-science, mint a key at app.sparkit.science/keys, one POST. Full reference in the API docs.

If you find a question where SPARKIT invented something it should have looked up, tell us — that is exactly the bug we want to hear about.

April 28, 2026

Equip your agent with deep research in one line of code

We ran the same hard scientific question through Perplexity, ChatGPT Deep Research, Gemini Deep Research, Elicit, and SPARKIT. All five got it right. Here is what was actually different — and why only one of them runs inside your codebase.

Read post →

April 28, 2026

SPARKIT 101: from pip install to a cited research report

An end-to-end walkthrough using a real HLE-Gold question: install the SDK, send the call, read the report, and see the agent trace behind a 53% benchmark answer.

Read post →

April 26, 2026

SPARKIT is live

A scientific research agent in an API. One call deploys an agent that retrieves and synthesizes the relevant literature, performs any analyses the question requires, and returns a Markdown report.

Read post →

The question

The setup

Finding 1: both direct LLMs invented Nature's article count

Finding 2: speed without retrieval is just fast wrongness

What SPARKIT actually did

Why this matters

The agentic difference

Methodology

Where to go next

More from the blog

Equip your agent with deep research in one line of code

SPARKIT 101: from pip install to a cited research report

SPARKIT is live