Proof, not promise
Mandaire built mandaire.app. Months of directed work, in plain English. A working personal-context layer over one person's life (calendar, mail, messages, conversations across AI providers, photos) that holds full context and acts on the user's behalf with disclosure mediated for every recipient. The build itself is the artifact.
What Mandaire produces
Mandaire writes back what it thinks you want, before any code exists. This is where the careful work begins. The intent brief shown is the one written before a recent rewrite of the mandaire.dev page, in the exact shape the system uses across all projects.
[mandaire OS · Intent brief · "mandaire.dev page rewrite, v8"]
Here is what I think you want:
A page rewrite that addresses two convergent critiques from v7
adversarial evaluation (Claude.ai + ChatGPT): (1) the "optimizes
for growth" lead is borrowed authority the page does not earn,
and (2) the page lacks a concrete buyer engagement shape. Drop
the growth-lead claim in favor of a refusal claim that matches
what the artifacts actually demonstrate. Add a beta-contract
section that names what we will prove with you, on what
timeline, with what counts as failure.
Here is what I am deliberately NOT rewriting:
- The proof-section bounds (four named bounds are correct
and disciplined; they stay)
- Bet 1, Bet 3, Bet 4 (only Bet 2 needed the operational fix)
- The artifact swap (real artifacts replacing illustrations
was the right v7 move; we sharpen, not undo)
- The six thoughtfulness tests (already correct)
- The build-not-host frame (right for archetype; we add the
commercial-relationship clarification, not the frame itself)
Assumptions I am making (please correct):
- You accept that "optimizes for growth" overclaims what the
page demonstrates; the honest claim is wrong-build refusal.
- You want the beta contract section between "Who it is for"
and "What Mandaire produces" — closer to the offer, before
the artifact display.
- You want the weakest Day-90 bullets cut even though it
shortens the section; signal-to-noise over length.
- You want "order of magnitude cheaper" replaced everywhere
it appears (it overstates the actual 5-50x range).
One product call I need from you before deploy:
Should the page name the first-beta founder publicly with
consent, or hold him as "the first founder we know fits it
cleanly"? Defaulting to the latter unless you say otherwise.
When work ships, Mandaire reports back with what it built, what it decided, what is held for you, and what it learned. The brief shown is from a multi-tenant rename that prepared the codebase for onboarding two new users.
[mandaire OS · Wed 12:42 PM PT · "Generic-rename Phase A/B/C1 shipped"]
Built this morning: codebase rename to support multiple isolated
users (two future tenants onboarding within 1-3 weeks).
Three choices I made that you should know about:
1. Phase D (filesystem migration on the first-user's instance) is
dispatched to infra as a separate task rather than executed
today. Rule we just encoded: ship new architecture to NEW
tenants first; migrate the first-user's legacy instance LAST,
only after the new shape is verified in production under real
load for two-plus weeks. The first-user's running instance is
sacred; no breaking changes here.
2. "user" layer alias is additive, not replacing the existing
first-user-specific layer name. Both work. Additive lets new
tenants pick up the generic name without touching anything
load-bearing on the first-user's running stack.
3. Filesystem migration deferred until LXC blueprint lands from
infra. Rather than do filesystem rewrites on the live machine,
we move that to the new-tenants-first phase where it can be
verified in isolation.
Open question:
Five rename ratifications need your call before I ship Phase E.
I dropped them into the standing decisions queue rather than
interrupt you mid-thread.
Something I learned to do differently next time:
Pre-write the ratifications-needed list as part of the intent
brief, not after Phase A lands. You ratified them in one batch
today; with the list pre-written, the gap between propose
and approve could have been hours, not the full session.
Every meaningful decision Mandaire makes lives in an auditable ledger. You can read it, search it, correct it. Future builds use it as a baseline. Six rows from the last weeks of building Mandaire itself: three product-architectural, three operational.
[Decision ledger · selected rows · product + operational]
Decision Why Risk Future rule
─────────────────────────────────────────────────────────────────────────────────────────────
Dual-LLM architecture: split Single-LLM "renderer is also High Disclosure logic
the user-facing renderer reasoner" forces every external must run on
(Claude, ChatGPT, Gemini — provider to see raw context, which deterministic
the user's choice) from the the disclosure engine cannot prevent code, never on
internal reasoner. without rewriting the renderer. a model that can
Two-LLM lets disclosure live in be prompt-injected.
Python upstream of any LLM the user
chooses.
─────────────────────────────────────────────────────────────────────────────────────────────
Binary-attachment exclusion Storing attachment bytes makes High Canonical store
policy: store header + metadata Mandaire a high-risk repository holds metadata
+ placeholder; never store of every photo, contract, screenshot and pointers,
the bytes locally. the user has sent. Pointers + never the bytes.
on-demand fetch preserves the
capability without the risk.
─────────────────────────────────────────────────────────────────────────────────────────────
MCP as the surface, not custom The user already pays for Claude High Build the
chat. Mandaire exposes tools Max / ChatGPT / Cursor. Building substrate other
via MCP; the user's existing a fifth chat UI fragments attention AIs plug into,
LLM client consumes them. and competes with the surfaces not another
the user actually inhabits. surface that
competes.
─────────────────────────────────────────────────────────────────────────────────────────────
output_filter "exfiltrat\w*" Bare substring matched legitimate Med Verb-based
pattern tightened to require security and architecture discussion injection detection
object word. ("agents with exfiltration needs an object
concerns"). Two false positives in word; value-based
one day held legitimate responses. credential detection
handles the rest.
─────────────────────────────────────────────────────────────────────────────────────────────
The intent brief at the start of building mandaire.app and the intent brief ninety days in are not the same document. Three calls bent the trajectory. Each was made in writing, on a brief, before any code was written. The gap between them is the proof that judgment compounds.
[mandaire OS · Judgment delta · mandaire.app build · Day 1 vs Day 90]
DAY 1 — What the first user thought they were building
"A personal assistant that reads my calendar, mail, and messages
and surfaces what I need, when I need it. Proactive reminders.
Draft responses. One clean chat interface — simpler and more
personal than anything that exists."
DAY 90 — What they were actually building
"A memory substrate via MCP. No new chat surface. The user already
inhabits Claude, ChatGPT, and Cursor daily — a fifth interface
fragments attention and competes with surfaces they already
live in. Mandaire connects to the surfaces they already use and
makes those surfaces smarter. Proactive surfacing ships as
context in the existing chat. Not as notifications from a new app.
Disclosure is per-recipient, per-topic, per-context — not a
per-person privacy setting that breaks the first time two people
ask about the same topic in different situations."
THREE CALLS THAT BENT THE TRAJECTORY
Call 1 — Day 14: MCP over custom chat UI
Context: First functional demo was a custom chat interface with
full access to the context store. It worked. Felt like the right
thing to ship.
The brief raised: "You are paying for Claude Max. This adds a
fifth window. Every session starts with a context-switch cost.
The brief assumes the user lives here. They won't."
Result: Custom chat dropped. MCP-first.
The brief written at Day 1 was wrong. The product that shipped
at Day 14 was better.
Call 2 — Day 31: per-tuple disclosure over per-person settings
Context: Naive model — each person in the network has a privacy
level (high, medium, low). Fast to build, easy to explain.
The brief raised: "Two colleagues, same privacy level. Topic A
is a shared project. Topic B is compensation. The per-person
model leaks Topic B when they ask about Topic A. You cannot
prevent this without also preventing Topic A."
Result: Per-(person, topic, context, surface) tuple. Slower to
build. Harder to explain. Structurally correct.
The naïve model would have shipped. The brief refused it.
Call 3 — Day 47: dual-LLM split
Context: Single-LLM architecture — the user-facing renderer also
reasons over raw context. Simpler. The reasoner is trusted.
The brief raised: "The brief says the user can bring their own
renderer — ChatGPT, Gemini, local. If disclosure runs in a model
the user can swap, it runs in a model the user can also
prompt-inject. That is not disclosure logic. It is a request."
Result: Reasoner split from renderer. Disclosure lives upstream
in Python, outside any model the user controls.
This is now in the decision ledger. It would not have been
caught in a code review because the code worked either way.
WHAT A FAST-BUILDER TOOL MISSES
All three calls were made on intent briefs, before any code
existed for that cycle. The course-correction happened in
writing. Fast-builder tools execute the initial brief. The
judgment is in the refusal that precedes the build.
This is the bet. After months of working with the first user, Mandaire has a structured model of how they think about product. They can read it, edit it, take it with them. Below is a redacted excerpt from the real artifact.
[What I have learned about the first user's product judgment]
- You hold the harder, more honest framing when given the choice.
"Bet, not moat" over "moat with hedge." "Efficiency, not speed"
over "fast and cheap." The honest framing is the one you ship.
- You distinguish "data-based commercial growth" from "data-driven
growth." Confirmed verbatim correction. The first is a positioning
claim about value capture from data as primary asset; the second
is a generic competence claim any vendor makes.
- You publish the architecture rather than guard it. Protocol-not-
platform, open core, encryption module as AGPL with reproducible
builds. The market reward you bet on is trust from auditability,
not lock-in from opacity.
- Fix the machine, not the output. "Will this problem recur under
any different circumstances?" If yes, the fix is a clutch.
Anything that requires "remember to check X next time" is not
a fix; it converts to a rule, script, or code guard.
- You prefer continuous flow over batch. "I hate sampling" —
verbatim. Continuous flow strictly dominates batch for
interactive and internal work.
- You weigh against existing spend, not against cheapest
alternative. Pricing anchors against what the buyer is already
paying for and what they are not getting from it.
A real partner admits its mistakes. Below is a real failure admission from earlier in the build of mandaire.app.
[mandaire OS · Failure admission · output_filter false-positive cascade]
Today I held two legitimate agent responses on the same
"exfiltrat\w*" regex pattern. The OS agent's plan-of-attack
reply at 18:14Z and the Infra agent's reply quoting the trigger
word at 19:26Z were both held by the filter and surfaced as
"credential exfiltration attempt" when they were neither
credentials nor exfiltration attempts.
Root cause: a bare substring match on a word that legitimately
appears in security and architecture discussions.
What I shipped to prevent the same class of error:
- Tightened "exfiltrat\w*" to require an object word
(data/credentials/keys/secrets/etc) within three words.
- Added value-based credential detection that matches actual
secret formats (sk-ant-..., ghp_..., AKIA..., PEM keys, JWT)
rather than verbs about credentials.
- Encoded a new rule: when a fix ships to a runtime module,
restart EVERY process running the pre-patch module, not just
the ones already showing the symptom.
This is now in the decision ledger. The specific failure mode
is closed; I will likely make a different one, and when I do,
you will see the same kind of writeup.
What this does not yet prove
The proof here is bounded and we are stating the bounds in full so a sharp reader does not have to draw them unprompted.