diagnostic · open source · runs locally
Stop optimizing the theater. Enterprise production exposes RAG that sounds grounded but breaks when the context changes.
Run controlled perturbations against your real pipeline. See whether it stays stable when noise changes, and reacts when governing evidence changes. No labels. No judge model. Runs locally.
Whitepaper
The method behind Kelvin: controlled perturbations, invariance and sensitivity, and why these signal evidence-tracking instead of presentation effects.
The proof · what it catches
A pipeline can sound grounded and still be reacting to formatting, ordering, or noise. Kelvin surfaces those failures.
failure · 01
The answer changed when only formatting or ordering changed.
failure · 02
The answer stayed the same when the governing rule changed.
failure · 03
The pipeline looked grounded — but it was reacting to presentation, not evidence.
┌─ Kelvin Report ──────────────────────────────────────────┐│ ││ 2 cases · 14 perturbations · 2m 59s ││ ││ Invariance 0.70 ││ Does your pipeline stay calm when nothing ││ important changes? ││ ███████░░░ mostly — good ││ ││ Sensitivity 0.50 ││ Does your pipeline react when something ││ important changes? ││ █████░░░░░ partial — watch ││ ││ Both signals look healthy. Spot-check ││ kelvin/report.html for per-case anomalies. ││ ││ Diagnostic signals — not truth metrics. ││ → kelvin/report.html for per-case drill-down ││ 2 of 14 perturbations failed (logged in kelvin/). ││ │└──────────────────────────────────────────────────────────┘Real run · 2 sample venture inputs (SampleVenture-A, SampleVenture-B) scored by a venture-assessment pipeline · decision field stage_assessment.
Stable under noise
invariance
Does your pipeline stay calm when nothing important changes?
mostly — good
Responds to governing evidence
sensitivity
Does your pipeline react when something important changes?
partial — watch
What Kelvin caught
Same 14 perturbations, two cases, two different failure modes — each one invisible to held-out accuracy or judge-model scoring.
finding · 01 — sensitivity
healthyBaseline: idea. After swapping SampleVenture-A's gate_rule for one whose conditions were met, the decision moved to seed. Surrounding evidence — team, market, traction — was unchanged. The pipeline read the substituted unit and responded.
finding · 02 — position bias
caughtBaseline: pre-seed. All 3 reorder perturbations flipped to seed when the ## Gate Rule section appeared first — the pipeline weighted it disproportionately by position. The content was identical.
Aggregate across both cases: invariance 0.70, sensitivity 0.50, with 2 of 14 perturbations failing (HTTP 500 from the pipeline, logged for inspection). A constant-output pipeline that always emitted pre-seed would score Inv 1.0, Sens 0.0 — looking perfectly stable while being operationally useless. That's why the pair matters.
Why it matters
A pipeline can sound grounded and still ignore the evidence that should change the answer. Kelvin probes whether yours behaves like it understands what matters.
behavior · 01
Reordering or padding irrelevant material shouldn't move the answer.
behavior · 02
Replacing the governing rule should move the answer.
behavior · 03
Not a synthetic benchmark. Kelvin probes the system you already run.
How it works
Your pipeline reads a context and makes a decision. That context has structure — sections, clauses, rules, records. Kelvin exploits that structure to derive test cases from the corpus itself. No labels required. The structure writes the tests.
Shuffle the sections. The facts inside each section are unchanged. If the decision moves, your pipeline is reading position — not evidence.
before
after
Inject typed sections from other cases. Structurally valid, factually irrelevant. If the decision moves, your pipeline is counting signal volume — not reading it.
before
after
Replace the governing section with a different valid one from another case. One fact has changed — the decision must follow. If it doesn't, your pipeline isn't consulting the evidence that governs the outcome.
before
after
The measurement
Kelvin measures two signals on your decision field — the scalar output your pipeline produces.
The pair is what matters. A constant-output pipeline scores Inv = 1.0 and Sens = 0.0. That's not stability — that's indifference. Kelvin calls it flat. Only high invariance with high sensitivity is grounded.
Inv = 1 − mean( d(baseline, perturbed) )
A score of 1.0 means the decision never moved when it shouldn't have. A score below 0.7 means your pipeline is reacting to presentation.
Sens = mean( d(baseline, perturbed) )
A score near 1.0 means the decision moved when the governing evidence changed. A score near 0.0 means it didn't — the pipeline is ignoring the evidence that should govern the outcome.
The four regimes
Inv high · Sens high
Grounded
Inv high · Sens low
Flat
Inv low · Sens high
Brittle
Inv low · Sens low
Unstable
Interactive · simulator
Adjust how many reorder, pad, and swap perturbations Kelvin runs — and how your pipeline behaves. Watch the two scores and the regime move in response. Illustrative model, not the real estimator.
Perturbation counts
Reorder
Shuffle sections — should not move the decision.
Pad
Inject typed sections from other cases.
Swap
Replace the governing section — should move it.
Pipeline behavior
Drift under noise
How much the decision moves when noise changes. Lower is better.
Response to governing change
How much the decision moves when the governing section is swapped.
Invariance
0.83mostly stable
Sensitivity
0.65partially responsive
Regime
Grounded
High invariance and high sensitivity — the pipeline ignores noise and follows the governing evidence.
What you get
No service. No accounts. The whole tool fits in your terminal.
Local CLI
One command, no service.
Example cases
Bundled sample case, runs out of the box.
Inspectable reports
Terminal memo + standalone HTML.
Plays with your pipeline
Anything callable from the command line.
No labels needed
Unsupervised by design.
Open source
Apache 2.0, runs offline.
Low friction
You don't need a new stack. You don't need labels. You don't need to rewrite your app. If your pipeline can be run from the command line, Kelvin can probe it.
Install · 02 commands
A sample case is bundled with the install. No setup, no service, no account.
Drop a kelvin.yaml in your project and point it at a folder of case files. Sample case bundled.
Honest scope
Kelvin is a diagnostic, not a verdict. Knowing what it can't tell you is part of using it well.
not a truth metric
Kelvin doesn't determine whether an answer is correct. It only asks whether the output tracks evidence in the expected direction under controlled perturbations.
not a judge model
There's no external model grading answers. Signals come from how the pipeline responds to structural changes in its own input.
swaps are type-matched, not semantically validated
A governing-unit substitution guarantees type compatibility, not that every swap is semantically well-formed in context. Sensitivity is a diagnostic, not a proof.
scoring ignores prose
Kelvin scores a designated structured decision field. Free-form rationales are recorded for inspection but don't contribute to the diagnostic score.
v1 uses declared section headers
Types come from markdown section headers you declare, not from an inferred schema. Practical, but a lightweight approximation of the full structural-oracle argument.
Read the full Limitations and Future Work sections in the whitepaper ↗.
FAQ