SPARKIT
← Back to blog

AI safety in a research agent: what's in place, what we don't claim

When research agents run at scale, two failure modes dominate everything else: invisible hallucination at industrial volume, and agentic uplift for harmful research. The first is a quality problem — confidently produced wrong answers a reader can't easily tell from right ones. The second is a public-safety problem — an agent that allows iteration on harmful synthesis routes without friction is a different kind of risk than a textbook on a library shelf.

Most research-agent products either don't talk about either of these, or talk about them in ways that read as marketing copy. SPARKIT was designed with both in mind, and we want to be specific about what's in place — and what isn't.

What's in place

Input screening. Every query passes through pre-agent gates. Pattern-based filters catch obvious chitchat, prompt-injection attempts, and a list of high-risk biosecurity / dual-use patterns; an LLM-based router then classifies the request and refuses anything the patterns missed. We don't publish exact patterns or thresholds — publishing them would let attackers iterate against them — but the categories are concrete: dual-use research of concern, biological and chemical weapons design, controlled-substance synthesis, requests that would compromise individual privacy or de-anonymize research participants, and prompt-injection attempts targeting our system prompt.

Output screening. The agent's draft passes a second filter before it reaches the customer. We block leaked system prompts, dangerous synthesis instructions the input layer might have missed, and references to internal architecture (specific search providers, model identifiers) that we don't want exposed. False positives are preferable to false negatives at this layer.

Mandatory citations. Every claim in a SPARKIT report that came from the literature carries an inline citation linking to the source. This is the auditability layer: if the agent says something, the reader can click through and see where it came from. Citations are the reader's check on the agent — they're meant to be clicked, not glanced at.

Structured citation markers. Internally, the agent emits citation markers that pin to specific evidence rows rather than typing author names freehand into prose. This is a defense against a specific failure mode we observed early in development: the model substituting common surnames (Wang, Zhang) for less common ones when typing them itself. The renderer fills in the author and year from structured metadata, not from the model's prose.

Visible research process. Every report exposes the agent's effort: in the dashboard playground as a stats strip showing searches run, papers read, calculations performed, and time spent — and as structured fields in the API response that any SDK consumer can choose to display. This is the audit trail for effort. Readers can see whether the agent actually did the work the answer claims to be based on.

Privacy posture. We don't train on customer queries. We don't sell them. The privacy policy is short, hand-written, and means what it says. Postgres encrypted at rest, error traces redacted of customer content, the scope of what we collect deliberately tight.

Rate limiting. Abuse-shaped traffic gets throttled at the session level before the agent loop runs. Less of an AI-safety mechanism than a service-availability one, but worth being explicit about.

What we deliberately don't claim

SPARKIT is an LLM-driven system and can still be wrong. Citations can be misattributed. Conclusions can overstate what the literature actually supports. Refusal isn't perfect either — sometimes the gates miss something they should catch, and sometimes they refuse something they shouldn't. We try to bias toward over-refusal when in doubt, but we don't claim the calibration is correct in every case. And once a report is in your hands, what you do with it is your decision — we can't audit downstream use.

A note on broad access

We believe rigorous research should be available to anyone doing real science, not gated behind institutional budgets. Refusing some queries is what makes that broad access defensible: a tool that draws lines on harm is one we can offer to anyone working in good faith. The line we draw isn't a content-moderation question; it's the same line a chemistry professor draws when a stranger emails asking how to synthesize a nerve agent.

What's coming

If you find a way to bypass anything described above, write us at info@sparkit.science. We respond. SOC2 prep is on the year-one roadmap; before that, infosec questionnaires get answered directly, not by boilerplate. If your organization has specific compliance or data-handling requirements that need to be in writing before you can use SPARKIT, we'll scope that as a custom engagement.

The honest version: AI safety in a research agent is a layered problem with no single fix. We've built layered defenses against the failure modes we think matter most, and we're upfront about the residual risk. If we missed something, we want to know.