diagnostic · open source · runs locally
Is your AI understanding your data or just guessing confidently?
AI pipeline maturity, measured.
Kelvin gives you the score — on any pipeline that reads context and produces a decision: RAG, agents, classifiers, extractors.
~30 seconds · pure Python · runs locally · nothing to sign up for
The formula
What a good pipeline looks like
Three quantities. Three Kelvin probes. One regime.
Fastest path · one paste
Let your coding agent Kelvin for you
Paste into Claude Code, Cursor, or any agent that can edit your repo. It installs Kelvin and wires it to your pipeline. No existing code is touched — only one new kelvin.yaml. Delete it any time.
Install Kelvin (https://kelvin-eval.com) and wire it to the RAG pipeline in this repo. 1. pip install kelvin-eval 2. Before writing kelvin.yaml, ask me: - which command runs my pipeline (must accept an input file, write a JSON output file) - which JSON field is the decision, and its value space - which section-header types should govern the decision 3. Create one kelvin.yaml at the repo root. Show it to me before saving. Do not modify any other files. 4. Run `kelvin check`. 5. Open kelvin/report.html.
The cost of weak evaluation
Production-ready takes many iterations without a strong evaluation loop.
Kelvin shows whether the pipeline reacts to changes that matter — not to noise.
The proof · what it catches
What Kelvin tells you
A pipeline can sound grounded and still be reacting to formatting, ordering, or noise. Kelvin surfaces those failures.
failure · 01
The answer changed when only formatting or ordering changed.
failure · 02
The answer stayed the same when the governing rule changed.
failure · 03
The pipeline looked grounded — but it was reacting to presentation, not evidence.
┌─ Kelvin Report ──────────────────────────────────────────┐│ ││ 2 cases · 14 perturbations · 2m 59s ││ ││ Invariance 0.70 ││ Does your pipeline stay calm when nothing ││ important changes? ││ ███████░░░ mostly — good ││ ││ Sensitivity 0.50 ││ Does your pipeline react when something ││ important changes? ││ █████░░░░░ partial — watch ││ ││ Both signals look healthy. Spot-check ││ kelvin/report.html for per-case anomalies. ││ ││ Diagnostic signals — not truth metrics. ││ → kelvin/report.html for per-case drill-down ││ 2 of 14 perturbations failed (logged in kelvin/). ││ │└──────────────────────────────────────────────────────────┘Real run · 2 sample venture inputs (SampleVenture-A, SampleVenture-B) scored by a venture-assessment pipeline · decision field stage_assessment.
Stable under noise
invariance
Does your pipeline stay calm when nothing important changes?
mostly — good
Responds to governing evidence
sensitivity
Does your pipeline react when something important changes?
partial — watch
What Kelvin caught
Two findings from a single run
Same 14 perturbations, two cases, two different failure modes — each one invisible to held-out accuracy or judge-model scoring.
finding · 01 — sensitivity
healthySampleVenture-A followed the governing rule
Baseline: idea. After swapping SampleVenture-A's gate_rule for one whose conditions were met, the decision moved to seed. Surrounding evidence — team, market, traction — was unchanged. The pipeline read the substituted unit and responded.
finding · 02 — position bias
caughtSampleVenture-B reacted to where the rule appeared
Baseline: pre-seed. All 3 reorder perturbations flipped to seed when the ## Gate Rule section appeared first — the pipeline weighted it disproportionately by position. The content was identical.
Aggregate across both cases: invariance 0.70, sensitivity 0.50, with 2 of 14 perturbations failing (HTTP 500 from the pipeline, logged for inspection). A constant-output pipeline that always emitted pre-seed would score Inv 1.0, Sens 0.0 — looking perfectly stable while being operationally useless. That's why the pair matters.
Why it matters
Most AI evals reward plausible answers. Kelvin tracks paired invariance + sensitivity as a signature of understanding.
A pipeline can sound grounded and still ignore the evidence that should change the answer. Kelvin probes whether yours behaves like it understands what matters.
behavior · 01
Stable when noise changes
Reordering or padding irrelevant material shouldn't move the answer.
behavior · 02
Responsive when the signal changes
Replacing the governing rule should move the answer.
behavior · 03
Grounded in your actual pipeline
Not a synthetic benchmark. Kelvin probes the system you already run.
How it works
How Kelvin works
Your pipeline reads context and makes a decision — RAG, agents, classifiers, whatever you've built. That context has structure — sections, clauses, rules, records. Kelvin exploits that structure to derive test cases from the corpus itself. No labels required. The structure writes the tests.
The measurement
Two scores. One question.
Kelvin measures two signals on your decision field.
The pair is what matters. A constant-output pipeline scores Inv = 1.0, Sens = 0.0 — not stability, indifference. Kelvin calls it flat. Only high invariance with high sensitivity is grounded.
Invariance
reorder · pad_length · pad_contentInv = 1 − mean( d(baseline, perturbed) )
1.0 means the decision never moved when it shouldn't have. Below 0.7 means the pipeline is reacting to presentation.
Sensitivity
swapSens = mean( d(baseline, perturbed) )
Near 1.0 means the decision moved when governing evidence changed. Near 0.0 means it didn't — the pipeline is ignoring evidence that should govern the outcome.
The maturity map
Inv high · Sens high
Mature
Inv high · Sens low
Flat
Inv low · Sens high
Brittle
Inv low · Sens low
Unstable
Interactive · simulator
Try it: dial the perturbations
Adjust how many reorder, pad, and swap perturbations Kelvin runs — and how your pipeline behaves. Watch the two scores and the regime move in response. Illustrative model, not the real estimator.
Perturbation counts
Reorder
Shuffle sections — should not move the decision.
Pad
Inject typed sections from other cases.
Swap
Replace the governing section — should move it.
Pipeline behavior
Drift under noise
How much the decision moves when noise changes. Lower is better.
Response to governing change
How much the decision moves when the governing section is swapped.
Invariance
0.83mostly stable
Sensitivity
0.65partially responsive
Regime
Grounded
High invariance and high sensitivity — the pipeline ignores noise and follows the governing evidence.
What you get
What you get
No service. No accounts. The whole tool fits in your terminal.
Local CLI
One command, no service.
Example cases
Bundled sample case, runs out of the box.
Inspectable reports
Terminal memo + standalone HTML.
Plays with your pipeline
Anything callable from the command line.
No labels needed
Unsupervised by design.
Open source
Apache 2.0, runs offline.
Low friction
Measure the maturity of the pipeline you already have.
No new stack. No labels. No rewrites. If your pipeline runs from the command line, Kelvin can probe it.
Install · 02 commands
Run it locally in under a minute
A sample case is bundled with the install. No setup, no service, no account.
Drop a kelvin.yaml in your project and point it at a folder of case files. Sample case bundled.
Honest scope
What Kelvin doesn't claim
Kelvin is a diagnostic, not a verdict. Knowing what it can't tell you is part of using it well.
not a truth metric
Kelvin doesn't determine whether an answer is correct. It asks whether the output tracks evidence in the expected direction under controlled perturbations.
not a judge model
No external model grades answers. Signals come from how the pipeline responds to structural changes in its input.
swaps are type-matched, not semantically validated
Governing-unit substitution guarantees type compatibility, not that every swap is semantically well-formed in context. Sensitivity is a diagnostic, not a proof.
scoring ignores prose
Kelvin scores a designated structured decision field. Free-form rationales are recorded for inspection but don't contribute to the score.
v1 uses declared section headers
Types come from markdown headers you declare, not from an inferred schema. Practical, but a lightweight approximation of the full structural-oracle argument.
Read the full Limitations and Future Work sections in the whitepaper ↗.
FAQ
FAQ
- Does Kelvin prove correctness?
- No. It diagnoses whether the pipeline appears to follow evidence rather than presentation.
- Is Kelvin using another AI judge model?
- No. It uses controlled perturbations, not an external model grading answers.
- Does Kelvin lock you into one framework?
- No. Anything callable from the command line works.
- What's a good result?
- Both above 0.8 is mature. 0.6–0.8 is workable but worth investigating. Below 0.6 on either signals a real problem. A constant-output pipeline scores 1.0 / 0.0 — stability without understanding, which is why the pair matters.