Our research team works on the layer above the model: the loops, evaluations, memory designs, and verification patterns that turn frontier reasoning into a teammate you can actually trust with real work.
Renderell agents propose, then verify against a separate model and a set of constraints. Internal evals show this cuts hallucinated tool calls by >75% in the workflows that matter.
An agent that always tries is dangerous. An agent that knows when to stop and ask is useful. We've built calibrated uncertainty estimation into the decision loop.
Cheaper models handle classification and lookups; frontier models handle planning and decisions. Our router picks per step — typical jobs run at 30% of single-model cost with higher accuracy.
Per-user, per-account, per-workflow memory. Preferences, glossaries, prior decisions, and edge cases persist — so the agent compounds in usefulness instead of resetting every session.
Typed tools with schema-checked arguments fail loudly instead of silently. Idempotency keys and dry-runs make actions safe to retry without double-billing your customer or double-filing your ticket.
We treat task-specific evals as a first-class artifact. Every customer workflow gets its own eval set; every model upgrade is gated by it. No silent regressions.
Models keep getting smarter. The reliability gap between "impressive demo" and "production teammate" hasn't closed at the same pace — because that gap is filled by infrastructure, not intelligence.
You need a reasoning loop that doesn't double-act. Memory that doesn't bleed across users. Tools that fail loudly. Approvals that route to the right human. Rollback when an action lands wrong. We've spent years on each of these, separately and together — and Renderell is the system that wires them up.
The frontier model is rented from the labs. The reliability layer is built by us — and it's what customers actually pay for.
One mega-prompt is brittle. A graph of small renderers, each tuned to one task, beats it across every quality and cost axis.
If an agent did something, you should be able to see what, why, and how to undo it — in one click, not one investigation.
How we cut hallucinated tool calls by 78% on our internal eval set — and why that number is misleading without the right benchmark.
Picking the right model per step is harder than picking the cheapest. A discussion of what we tried and what stuck.
Notes on per-user vs per-workflow memory, fact extraction vs retrieval, and the tradeoffs we landed on.
Why pass-rate isn't enough, and what we measure instead: action-fidelity, escalation-precision, rollback-rate.
How strongly-typed tools change the safety profile of an agent — and why we generate ours from OpenAPI specs.
Agents that run for hours or days need different architecture than agents that run for seconds. Notes on what changes.
Read the research notes, or talk to the team about applying these ideas to your domain.