May 16, 2026

How to think about the deep-research tool landscape

Deep research went from a niche to a feature on every major chat platform in under a year. Perplexity, ChatGPT, Claude, Gemini, Elicit, You.com, Consensus, SPARKIT — most platforms now ship some flavor of "ask a hard question, the system goes off and pulls together a multi-source answer." If you are choosing which tool to use, what matters is no longer do they get the answer right — that floor is now table stakes for any question with a literature-grounded answer. What matters is how the tool fits your workflow.

This post is the map. For a tactical head-to-head — same question sent to five tools — see Equip your agent with deep research in one line of code.

What every deep-research tool has in common

Pull back from the marketing and the surface differences look small.

Multi-step retrieval. A question goes in, the system runs multiple searches, reads documents (often PDFs), reasons over what it found, and iterates. No credible deep-research product is a single LLM call.
Citations. Every serious product surfaces sources alongside the answer. The form varies — inline URLs, footnotes, a sidebar — but the principle is identical.
Time/quality tradeoff. Every tool spends more wall-clock time than a one-turn chat. Anywhere from ~10 seconds (Perplexity) to 15 minutes (Gemini Deep Research). Depth scales with the wait.
Safety screening. All major tools refuse dual-use research, weapons synthesis, and the obvious harm categories, with varying calibration.
Frontier LLMs underneath. Most use GPT-4-class or Claude-Opus-class models. The lift over a single LLM call comes from the agent scaffolding, not from a bigger model.

If you are choosing between deep-research tools because you think one of them is dramatically more accurate than another on a research question with a real literature answer, you are probably overweighting that axis. Across the questions we have tested, correctness clusters tight on questions where the answer is in the literature.

Where deep-research tools diverge

The interesting differences are about audience, deployment, and output shape — not which one is "smarter."

Audience

Tool family	Built for
Perplexity, You.com	General consumer search-replacement with a chat surface
ChatGPT Deep Research, Gemini Deep Research, Claude Research	Power users at the chat platform you already pay for
Elicit, Consensus	Academic researchers searching the literature
SPARKIT	Developers and research teams calling research from inside an agent or backend

The audience choice cascades into everything else. A product built for a chat user can be UI-only. A product built for a backend has to be API-first.

Deployment surface

This is the single largest functional split in the category today.

UI-only. ChatGPT Deep Research, Gemini Deep Research, and Claude Research live inside their respective chat apps. You cannot call them from your own code.
API + UI. Perplexity has Sonar; You.com has its Research surface; SPARKIT is API-first and the dashboard wraps the same API.
Hybrid. Elicit has an API but it is shaped around their literature-search UX, not arbitrary research questions.

If you are building an agent or putting research into a backend pipeline, the UI-only options are off the board even if they would win on answer-quality. The remaining tools form your actual short list.

Output shape

What you get back varies more than people expect:

Chat-shaped. Perplexity, Claude Research, You.com — a few hundred words, conversational tone, citations inline.
Long-form memo. ChatGPT Deep Research, Gemini Deep Research — three to six thousand words, multi-section, often with optional experimental plans.
Structured Markdown + JSON. SPARKIT — a sectioned Markdown report (~500–2000 words) plus a sources array with title, URL, DOI, year, and citation count for each source.

The right shape depends on who or what consumes the output next. If a human is reading the answer, conversational is fine. If an agent is piping it into a downstream step, structured fields beat prose. Most consumer tools optimize for case 1; SPARKIT optimizes for case 2 without giving up case 1.

Domain focus

Most tools are general-purpose web research. A handful specialize:

Elicit, Consensus — scientific literature first.
SPARKIT — scientific literature plus PDFs, web, sandboxed Python execution, citation traversal.
Everyone else — the general web.

For questions where the right answer lives in a 2017 paper nobody has cited in three years, general-web tools often miss it. The benchmark numbers separate cleanly here. On HLE-Gold (Humanity's Last Exam, gold subset — biology, medicine, chemistry), SPARKIT scores 54.4% versus 34.9% for both direct GPT-5.5 and Claude Opus 4.8. The lift is the agentic literature-search scaffold, not a different model underneath. On GAIA, which is general-web reasoning, SPARKIT scores 73.2% versus 58.2% for Exa and 57.0% for Brave.

Public benchmark coverage

A subtle but important differentiator: which tools publish benchmark numbers on a known evaluation, and which ask you to take their word for it.

Published. SPARKIT publishes head-to-head numbers on HLE-Gold and GAIA — see the About page for the headline figures and methodology, and the GAIA case-study post for a worked example of a specific failure mode in direct LLMs that the agent fixes.
No published deep-research benchmarks. At the time of writing, none of ChatGPT Deep Research, Gemini Deep Research, Claude Research, or Elicit publishes head-to-head benchmark numbers comparing their deep-research surface against frontier alternatives on a public eval like HLE or GAIA. Underlying frontier-model benchmarks (GPQA, MMLU, AIME) tell you about the base model — not the agent scaffold sitting on top.
Partial. Perplexity publishes Sonar benchmarks against general search, but not on scientific-research questions specifically.

A tested agent is more trustworthy than an untested one. Benchmarks are not perfect — they can be gamed, overfit, or fail to capture what your team actually cares about — but they are the closest thing the industry has to a verifiable trust signal. Before committing a workflow to a deep-research tool, ask: how does it score on a public eval, against what comparators, with what methodology?

Users should know what they are getting per query. A benchmark number is the most compact answer to that question the market has produced.

Citation auditability

A subtle but important axis. Some tools cite by URL only. Some give you DOI and year. Some return a structured object the next step in your pipeline can verify against. If you are using a deep-research tool in a context where the citations matter — a clinical workflow, a grant draft, a regulatory submission, a tumor-board prep — citation auditability is closer to the top of the requirements list than "answer quality." If the citations are not machine-readable, you cannot programmatically verify them, and at scale you will not verify them manually.

The map

Combining the axes, the picture roughly looks like:

Tool	Audience	Surface	Output	Wait
Perplexity	Consumer	API + UI	Chat	~10s
ChatGPT Deep Research	Power user	UI only	Long memo	5–10 min
Gemini Deep Research	Power user	UI only	Long essay	5–15 min
Claude Research	Power user	UI only	Chat	varies
Elicit	Academic	UI (limited API)	Chat + follow-up	~15s
SPARKIT	Developer / research team	API-first	Structured Markdown + JSON	~90s

(For latency and verbosity on a specific question, see the head-to-head.)

Where SPARKIT sits

We built SPARKIT because the "API-first, scientific-literature-deep, structured-output, sub-two-minute" corner of the category was empty, and because we kept watching researcher friends paste the output from one chat tool into another to do work the second tool could not do natively. The pitch is narrow on purpose:

API-first. Same engine the dashboard uses is what your code calls. No UI-only feature set.
Scientific literature, deeply. PubMed, bioRxiv, the Nature / Cell / Science archives, DOI traversal, PDF reading. We try to be excellent at the slice of research that involves the published scientific literature.
Structured output. Markdown for humans, JSON sources for the next pipeline step.
Sub-two-minute median. Fast enough to await inline in most agents; supports a callback_url for the rest.
Cited reports as standard. Every claim from the literature gets an inline citation. Show the work.

For enterprise and research teams whose use case sits outside the off-the-shelf agent — connecting the agent to proprietary databases your group already pays for, internal corpora behind enterprise auth, paywalled scientific sources, a domain that needs targeted tuning, or a workflow where the agent is one component of a larger institutional pipeline (grant drafts, tumor boards, regulatory submissions, lab-notebook integration) — we offer custom engagements. The research agent is one SKU; the custom side is the other half of how SPARKIT delivers value. More on services →

Where SPARKIT does not win: if you are a single researcher reading the answer yourself and you do not need the agent integration, Perplexity is faster and Elicit has nicer follow-up UX. If you want a 5,000-word memo you can hand to a junior lab member with an experimental plan attached, ChatGPT Deep Research delivers that better than we do.

We have written about the integration gap before; the head-to-head post is where to read more on that.

Where the category is going

A few bets we are willing to make publicly:

UI-only deep research will eventually expose APIs. Most likely candidates: ChatGPT Deep Research and Gemini Deep Research, on a multi-month horizon. When they do, the deployment-surface axis collapses for those products and the competition shifts to output shape, latency, and domain specialization.
Domain-specialized agents will continue to outperform generalists on domain questions. The HLE-Gold gap between SPARKIT and direct frontier models is not closing in the near term.
Citation auditability will matter more than raw answer quality. As more downstream systems consume deep-research output, machine-readable citations will become non-negotiable.
Latency will get shorter. A 90-second cited report is a hard floor today; we expect it to compress as agent scaffolds get more efficient.

Run a question of your own

The fastest way to figure out which tool fits your workflow is to send the same question to two or three of them and look at the actual returns.

For SPARKIT specifically: open the research agent, paste a question, two minutes later you have a cited Markdown report. The Try-it bundle is $10 for 5 queries; the SDK is pip install sparkit-science. No subscription needed to evaluate.

If you find a question where one of the other tools beats SPARKIT in a way that matters to you, tell us — that is exactly the feedback that shapes what we build next.

July 4, 2026

Reports can now include figures from the papers they cite

SPARKIT reports can now embed figures pulled straight from the open-access papers they cite, inline next to the relevant text. Figures appear only under four specific conditions — here's exactly when, and why some reports have none.

Read post →

June 5, 2026

Updates to query screening on the research API

A tighter pre-flight on every research query rolled out this week. Most users see no change. If a query is rejected, the API returns a structured error synchronously, your quota is not consumed, and no job row is created.

Read post →

May 11, 2026

From gene list to phenotype: turning a sequencing panel into a cited clinical report

Drop a five-gene hereditary-cancer panel into SPARKIT and get back the unifying syndrome, lifetime cancer risks, mechanism, and the current surveillance guideline — every claim cited and pulled from primary sources, not from the model's training memory.

Read post →

What every deep-research tool has in common

Where deep-research tools diverge

Audience

Deployment surface

Output shape

Domain focus

Public benchmark coverage

Citation auditability

The map

Where SPARKIT sits

Where the category is going

Run a question of your own

More from the blog

Reports can now include figures from the papers they cite

Updates to query screening on the research API

From gene list to phenotype: turning a sequencing panel into a cited clinical report