Deep research went from a niche to a feature on every major chat platform in under a year. Perplexity, ChatGPT, Claude, Gemini, Elicit, You.com, Consensus, SPARKIT — most platforms now ship some flavor of "ask a hard question, the system goes off and pulls together a multi-source answer." If you are choosing which tool to use, what matters is no longer do they get the answer right — that floor is now table stakes for any question with a literature-grounded answer. What matters is how the tool fits your workflow.
This post is the map. For a tactical head-to-head — same question sent to five tools — see Equip your agent with deep research in one line of code.
What every deep-research tool has in common
Pull back from the marketing and the surface differences look small.
- Multi-step retrieval. A question goes in, the system runs multiple searches, reads documents (often PDFs), reasons over what it found, and iterates. No credible deep-research product is a single LLM call.
- Citations. Every serious product surfaces sources alongside the answer. The form varies — inline URLs, footnotes, a sidebar — but the principle is identical.
- Time/quality tradeoff. Every tool spends more wall-clock time than a one-turn chat. Anywhere from ~10 seconds (Perplexity) to 15 minutes (Gemini Deep Research). Depth scales with the wait.
- Safety screening. All major tools refuse dual-use research, weapons synthesis, and the obvious harm categories, with varying calibration.
- Frontier LLMs underneath. Most use GPT-4-class or Claude-Opus-class models. The lift over a single LLM call comes from the agent scaffolding, not from a bigger model.
If you are choosing between deep-research tools because you think one of them is dramatically more accurate than another on a research question with a real literature answer, you are probably overweighting that axis. Across the questions we have tested, correctness clusters tight on questions where the answer is in the literature.
Where deep-research tools diverge
The interesting differences are about audience, deployment, and output shape — not which one is "smarter."
Audience
| Tool family | Built for |
|---|
| Perplexity, You.com | General consumer search-replacement with a chat surface |
| ChatGPT Deep Research, Gemini Deep Research, Claude Research | Power users at the chat platform you already pay for |
| Elicit, Consensus | Academic researchers searching the literature |
| SPARKIT | Developers and research teams calling research from inside an agent or backend |
The audience choice cascades into everything else. A product built for a chat user can be UI-only. A product built for a backend has to be API-first.
Deployment surface
This is the single largest functional split in the category today.
- UI-only. ChatGPT Deep Research, Gemini Deep Research, and Claude Research live inside their respective chat apps. You cannot call them from your own code.
- API + UI. Perplexity has Sonar; You.com has its Research surface; SPARKIT is API-first and the dashboard wraps the same API.
- Hybrid. Elicit has an API but it is shaped around their literature-search UX, not arbitrary research questions.
If you are building an agent or putting research into a backend pipeline, the UI-only options are off the board even if they would win on answer-quality. The remaining tools form your actual short list.
Output shape
What you get back varies more than people expect:
- Chat-shaped. Perplexity, Claude Research, You.com — a few hundred words, conversational tone, citations inline.
- Long-form memo. ChatGPT Deep Research, Gemini Deep Research — three to six thousand words, multi-section, often with optional experimental plans.
- Structured Markdown + JSON. SPARKIT — a sectioned Markdown report (~500–2000 words) plus a
sources array with title, URL, DOI, year, and citation count for each source.
The right shape depends on who or what consumes the output next. If a human is reading the answer, conversational is fine. If an agent is piping it into a downstream step, structured fields beat prose. Most consumer tools optimize for case 1; SPARKIT optimizes for case 2 without giving up case 1.
Domain focus
Most tools are general-purpose web research. A handful specialize:
- Elicit, Consensus — scientific literature first.
- SPARKIT — scientific literature plus PDFs, web, sandboxed Python execution, citation traversal.
- Everyone else — the general web.
For questions where the right answer lives in a 2017 paper nobody has cited in three years, general-web tools often miss it. The benchmark numbers separate cleanly here. On HLE-Gold (Humanity's Last Exam, gold subset — biology, medicine, chemistry), SPARKIT scores 53.0% versus 34.9% for direct GPT-5.5 and 28.9% for direct Claude Opus 4.7. The lift is the agentic literature-search scaffold, not a different model underneath. On GAIA, which is general-web reasoning, SPARKIT scores 75.6% versus 58.2% for Exa and 57.0% for Brave.
Public benchmark coverage
A subtle but important differentiator: which tools publish benchmark numbers on a known evaluation, and which ask you to take their word for it.
- Published. SPARKIT publishes head-to-head numbers on HLE-Gold and GAIA — see the About page for the headline figures and methodology, and the GAIA case-study post for a worked example of a specific failure mode in direct LLMs that the agent fixes.
- No published deep-research benchmarks. At the time of writing, none of ChatGPT Deep Research, Gemini Deep Research, Claude Research, or Elicit publishes head-to-head benchmark numbers comparing their deep-research surface against frontier alternatives on a public eval like HLE or GAIA. Underlying frontier-model benchmarks (GPQA, MMLU, AIME) tell you about the base model — not the agent scaffold sitting on top.
- Partial. Perplexity publishes Sonar benchmarks against general search, but not on scientific-research questions specifically.
A tested agent is more trustworthy than an untested one. Benchmarks are not perfect — they can be gamed, overfit, or fail to capture what your team actually cares about — but they are the closest thing the industry has to a verifiable trust signal. Before committing a workflow to a deep-research tool, ask: how does it score on a public eval, against what comparators, with what methodology?
Users should know what they are getting per query. A benchmark number is the most compact answer to that question the market has produced.
Citation auditability
A subtle but important axis. Some tools cite by URL only. Some give you DOI and year. Some return a structured object the next step in your pipeline can verify against. If you are using a deep-research tool in a context where the citations matter — a clinical workflow, a grant draft, a regulatory submission, a tumor-board prep — citation auditability is closer to the top of the requirements list than "answer quality." If the citations are not machine-readable, you cannot programmatically verify them, and at scale you will not verify them manually.
The map
Combining the axes, the picture roughly looks like:
| Tool | Audience | Surface | Output | Wait |
|---|
| Perplexity | Consumer | API + UI | Chat | ~10s |
| ChatGPT Deep Research | Power user | UI only | Long memo | 5–10 min |
| Gemini Deep Research | Power user | UI only | Long essay | 5–15 min |
| Claude Research | Power user | UI only | Chat | varies |
| Elicit | Academic | UI (limited API) | Chat + follow-up | ~15s |
| SPARKIT | Developer / research team | API-first | Structured Markdown + JSON | ~90s |
(For latency and verbosity on a specific question, see the head-to-head.)
Where SPARKIT sits
We built SPARKIT because the "API-first, scientific-literature-deep, structured-output, sub-two-minute" corner of the category was empty, and because we kept watching researcher friends paste the output from one chat tool into another to do work the second tool could not do natively. The pitch is narrow on purpose:
- API-first. Same engine the dashboard uses is what your code calls. No UI-only feature set.
- Scientific literature, deeply. PubMed, bioRxiv, the Nature / Cell / Science archives, DOI traversal, PDF reading. We try to be excellent at the slice of research that involves the published scientific literature.
- Structured output. Markdown for humans, JSON sources for the next pipeline step.
- Sub-two-minute median. Fast enough to await inline in most agents; supports a
callback_url for the rest.
- Cited reports as standard. Every claim from the literature gets an inline citation. Show the work.
For enterprise and research teams whose use case sits outside the off-the-shelf agent — connecting the agent to proprietary databases your group already pays for, internal corpora behind enterprise auth, paywalled scientific sources, a domain that needs targeted tuning, or a workflow where the agent is one component of a larger institutional pipeline (grant drafts, tumor boards, regulatory submissions, lab-notebook integration) — we offer custom engagements. The research agent is one SKU; the custom side is the other half of how SPARKIT delivers value. More on services →
Where SPARKIT does not win: if you are a single researcher reading the answer yourself and you do not need the agent integration, Perplexity is faster and Elicit has nicer follow-up UX. If you want a 5,000-word memo you can hand to a junior lab member with an experimental plan attached, ChatGPT Deep Research delivers that better than we do.
We have written about the integration gap before; the head-to-head post is where to read more on that.
Where the category is going
A few bets we are willing to make publicly:
- UI-only deep research will eventually expose APIs. Most likely candidates: ChatGPT Deep Research and Gemini Deep Research, on a multi-month horizon. When they do, the deployment-surface axis collapses for those products and the competition shifts to output shape, latency, and domain specialization.
- Domain-specialized agents will continue to outperform generalists on domain questions. The HLE-Gold gap between SPARKIT and direct frontier models is not closing in the near term.
- Citation auditability will matter more than raw answer quality. As more downstream systems consume deep-research output, machine-readable citations will become non-negotiable.
- Latency will get shorter. A 90-second cited report is a hard floor today; we expect it to compress as agent scaffolds get more efficient.
Run a question of your own
The fastest way to figure out which tool fits your workflow is to send the same question to two or three of them and look at the actual returns.
For SPARKIT specifically: open the research agent, paste a question, two minutes later you have a cited Markdown report. The Try-it bundle is $10 for 5 queries; the SDK is pip install sparkit-science. No subscription needed to evaluate.
If you find a question where one of the other tools beats SPARKIT in a way that matters to you, tell us — that is exactly the feedback that shapes what we build next.