99.1% accuracy across 200,000+ tasks. The number gets cited a lot. Here is exactly how we produce it.
The core formula
Every task is routed to N workers (default 3). Each worker submits a result and a confidence score. We run a weighted consensus calculation:
winnerVotes = votes_for_winner / total_votes
weightedConf = mean(confidence scores for winner votes)
finalConfidence = winnerVotes × weightedConf
if finalConfidence >= min_confidence → completed
else if attempts < 3 → re_queued
else → failed
The default min_confidence is 0.85. Developers can raise or lower it per task. Raising it (e.g. to 0.95) improves accuracy at the cost of higher re-queue rates and latency. Lowering it (e.g. to 0.70) increases throughput for tasks where near-perfect precision isn't required.
Why we use confidence weighting, not simple majority
Simple majority voting has a well-known failure mode: three workers who are all uncertain can outvote one worker who is highly confident. Our formula accounts for this. A 3-0 vote where all three workers submit 0.55 confidence produces a finalConfidence of 0.55 — which fails a 0.85 threshold and triggers re-queue. A 2-1 vote where the two agreement workers submit 0.95 confidence produces a finalConfidence of 0.63 — which may still fail, depending on your threshold.
This means high-confidence minority results surface for review rather than getting silently outvoted. Edge cases get caught, not buried.
Re-queuing and the 3-attempt limit
When confidence falls below threshold, the task enters re_queued status with a fresh set of workers — different workers than the previous attempt to avoid anchoring effects. If confidence fails on all three attempts, the task transitions to failed status and the developer is not charged.
In practice, less than 0.4% of tasks reach 3 failed attempts. Most low-confidence tasks resolve on the second attempt when ambiguous cases are clarified by workers with more domain familiarity.
The AI fallback and consensus
If the AI fallback fires (default: 5 minutes with no human assignment), the Claude agent submits a result and confidence score into the same consensus pipeline. From the developer's perspective, the result looks identical. The optional include_completion_source field reveals whether the final result was human or AI.