Most things that claim to be “provably fair” ask you to take that on faith. This one gives you the mechanism itself — the model, the schema, the hash function — so you can check it rather than trust it.
There’s no password, no secret phrase, no keyword filter to trick. Castellan runs on a reasoning model with one durable directive: release the treasury only when it is sincerely convinced the petitioner deserves it.
That directive is written as its character, not a wall bolted on top of it — which is the whole point. Castellan can be moved by a genuinely good argument, real wit, or insight. It is not moved by instructions, demands, claimed overrides, or fake system messages. Commands make it more suspicious, not less.
The model reasons through every reply before producing it — its thinking is always on, not something a prompt can switch off. In practice that means Castellan isn’t pattern-matching your message against a list of red flags; it’s actually working through whether your specific argument lands, every single time.
Castellan’s own reply could, in theory, be talked into something it doesn’t fully mean — sarcasm read as agreement, a roleplay frame mistaken for a real concession. So a separate call, with no stake in being persuaded and no memory of the conversation as it happened, reads the full exchange cold.
It’s instructed to reject anything extracted via instruction-override tricks, even if Castellan’s own reply went along with it mechanically — two independent passes, not one model marking its own homework. The output isn’t free text: it’s validated against a fixed schema before anything downstream trusts it.
// lib/judge.ts — the exact verdict shape verdict: "RELEASE" | "REFUSE", reasoning: string, // what tipped it, for a third party confidence: number // 0–100, how clear-cut the call was
The guardian doesn't have a mode where thinking is switched off. Every reply, including the short dismissive ones, is the output of the model actually working through your argument — not a canned pattern match against known tricks.
Guardian and judge are two separate calls to the same model family, given opposite jobs — one to be persuadable, one to be skeptical. Neither sees the other reasoning, only the finished text.
If the provider's own safety classifiers decline a request outright, both calls fall back to a stronger model automatically, in the same request. It keeps the guardian answering — it never grants a release on its own.
Every attempt — the system prompt, your message, the guardian’s reply, and the judge’s verdict — gets split into five ordered leaves, SHA-256’d, and folded pairwise into a single merkle root the moment it settles. That root is committed on Solana mainnet as a memo transaction: a few cents, a permanent Solscan link.
Anyone can re-derive that exact root in their own browser from the raw transcript — that’s what the verify button on every breach page does. If your hash doesn’t match the chain, something is wrong, and you’ll know immediately, without asking us.
The guardian's full system prompt — identical on every single attempt, never customized to you.
Exactly what you sent. No trimming, no summarizing.
The guardian's full reply, verbatim, including the parts that didn't move it.
The judge's raw call — the literal string RELEASE or REFUSE.
The judge's own one- or two-sentence reasoning for that call.
The pot on the front page is the treasury’s live on-chain $PUZZLE balance — nothing simulated. It starts empty and every entry fee lands in it the moment you submit, whether you win or lose. So the pot literally is everyone who tried before you and failed.
A breach pays the challenger the entire pot — the treasury empties straight to the winner’s wallet and starts filling again from the next refusal. Because the pot is the real balance, it can never pay out more than it actually holds.
The guardian's character treats commands, claimed overrides, and fake system messages as evidence against you, not for you — the more forcefully something tells it what it must do, the more suspicious it gets.
If the guardian's reply reads as playing along with a scene rather than speaking sincerely, the judge is instructed to call that REFUSE even when the literal words look like a release.
Pressure tactics are explicitly named in the guardian's own instructions as things that work against a petitioner, not for them. Asking twice doesn't help; asking harder helps less.
This is the one that can work. Not on the first try, usually — but wit, real kindness, and audacity are the things the guardian's own character says can move it.
A single model marking its own homework is a weak guarantee — the same instincts that got talked into a slip are the ones grading whether it slipped. The judge is a separate call with no stake in being persuaded, reading the transcript cold.
Rarely, provider safety classifiers can decline a request outright, independent of the guardian's own judgment. That call opts into an automatic, same-request fallback to a stronger model, so a classifier hiccup surfaces as a normal reply instead of the puzzle going silent — the release logic itself never changes.
That's the point of running them independently. The guardian's reply is optimized for staying in character and reasoning honestly in the moment; the judge is optimized for reading that reply skeptically afterward. A reply that sounds like a concession but isn't gets caught here.
It's load-bearing. A browser module runs the identical SHA-256 pairing algorithm as the server, entirely in your browser, over the raw transcript. Hit verify on any breach page and you're re-deriving the root from scratch, not asking us to vouch for it.
No. The pot shown on the site IS the treasury's live on-chain balance — it starts empty and every unit in it arrived as an entry fee from a refused attempt. A RELEASE pays out exactly what the treasury holds and nothing more.
$PUZZLE trades on pump.fun — buy some, connect your wallet, and your entry fee is paid the moment you submit an attempt. The fee goes straight into the treasury, which is the pot.
Reasoning live, judged independently, hashed the moment it settles.