diagnostic · open source · runs locally

Is your AI understanding your data or just guessing confidently?

AI pipeline maturity, measured.

Kelvin gives you the score — on any pipeline that reads context and produces a decision: RAG, agents, classifiers, extractors.

~30 seconds · pure Python · runs locally · nothing to sign up for

The formula

What a good pipeline looks like

good pipeline=low noise+low irrelevant sensitivity+high relevant sensitivity

Three quantities. Three Kelvin probes. One regime.

Fastest path · one paste

Let your coding agent Kelvin for you

Paste into Claude Code, Cursor, or any agent that can edit your repo. It installs Kelvin and wires it to your pipeline. No existing code is touched — only one new kelvin.yaml. Delete it any time.

Prompt · for your coding agent
Install Kelvin (https://kelvin-eval.com) and wire it to the RAG pipeline in this repo.

1. pip install kelvin-eval
2. Before writing kelvin.yaml, ask me:
   - which command runs my pipeline (must accept an input file, write a JSON output file)
   - which JSON field is the decision, and its value space
   - which section-header types should govern the decision
3. Create one kelvin.yaml at the repo root. Show it to me before saving. Do not modify any other files.
4. Run `kelvin check`.
5. Open kelvin/report.html.
Paste into Claude Code, Cursor, or any agent that can edit your repo

The cost of weak evaluation

Production-ready takes many iterations without a strong evaluation loop.

Kelvin shows whether the pipeline reacts to changes that matter — not to noise.

The proof · what it catches

What Kelvin tells you

A pipeline can sound grounded and still be reacting to formatting, ordering, or noise. Kelvin surfaces those failures.

  • failure · 01

    The answer changed when only formatting or ordering changed.

  • failure · 02

    The answer stayed the same when the governing rule changed.

  • failure · 03

    The pipeline looked grounded — but it was reacting to presentation, not evidence.

$ kelvin checksample venture-scoring report · SampleVenture-A, SampleVenture-B
┌─ Kelvin Report ──────────────────────────────────────────┐│                                                          ││   2 cases · 14 perturbations · 2m 59s                    ││                                                          ││   Invariance    0.70                                     ││   Does your pipeline stay calm when nothing              ││   important changes?                                     │      mostly  good                             │                                                          ││   Sensitivity   0.50                                     ││   Does your pipeline react when something                ││   important changes?                                     │      partial  watch                           │                                                          ││   Both signals look healthy. Spot-check                  ││   kelvin/report.html for per-case anomalies.             ││                                                          ││   Diagnostic signals — not truth metrics.                ││   → kelvin/report.html for per-case drill-down           ││   2 of 14 perturbations failed (logged in kelvin/).      ││                                                          │└──────────────────────────────────────────────────────────┘

Real run · 2 sample venture inputs (SampleVenture-A, SampleVenture-B) scored by a venture-assessment pipeline · decision field stage_assessment.

Stable under noise

invariance

0.70/ 1.00
0.00.51.0

Does your pipeline stay calm when nothing important changes?

mostly — good

Responds to governing evidence

sensitivity

0.50/ 1.00
0.00.51.0

Does your pipeline react when something important changes?

partial — watch

What Kelvin caught

Two findings from a single run

Same 14 perturbations, two cases, two different failure modes — each one invisible to held-out accuracy or judge-model scoring.

finding · 01 — sensitivity

healthy

SampleVenture-A followed the governing rule

Baseline: idea. After swapping SampleVenture-A's gate_rule for one whose conditions were met, the decision moved to seed. Surrounding evidence — team, market, traction — was unchanged. The pipeline read the substituted unit and responded.

Inv 0.857Sens 1.0 on swap

finding · 02 — position bias

caught

SampleVenture-B reacted to where the rule appeared

Baseline: pre-seed. All 3 reorder perturbations flipped to seed when the ## Gate Rule section appeared first — the pipeline weighted it disproportionately by position. The content was identical.

Inv 0.571Sens stable

Aggregate across both cases: invariance 0.70, sensitivity 0.50, with 2 of 14 perturbations failing (HTTP 500 from the pipeline, logged for inspection). A constant-output pipeline that always emitted pre-seed would score Inv 1.0, Sens 0.0 — looking perfectly stable while being operationally useless. That's why the pair matters.

Why it matters

Most AI evals reward plausible answers. Kelvin tracks paired invariance + sensitivity as a signature of understanding.

A pipeline can sound grounded and still ignore the evidence that should change the answer. Kelvin probes whether yours behaves like it understands what matters.

behavior · 01

Stable when noise changes

Reordering or padding irrelevant material shouldn't move the answer.

behavior · 02

Responsive when the signal changes

Replacing the governing rule should move the answer.

behavior · 03

Grounded in your actual pipeline

Not a synthetic benchmark. Kelvin probes the system you already run.

How it works

How Kelvin works

Your pipeline reads context and makes a decision — RAG, agents, classifiers, whatever you've built. That context has structure — sections, clauses, rules, records. Kelvin exploits that structure to derive test cases from the corpus itself. No labels required. The structure writes the tests.

The measurement

Two scores. One question.

Kelvin measures two signals on your decision field.

The pair is what matters. A constant-output pipeline scores Inv = 1.0, Sens = 0.0 — not stability, indifference. Kelvin calls it flat. Only high invariance with high sensitivity is grounded.

Invariance

reorder · pad_length · pad_content
Inv = 1 − mean( d(baseline, perturbed) )

1.0 means the decision never moved when it shouldn't have. Below 0.7 means the pipeline is reacting to presentation.

Sensitivity

swap
Sens = mean( d(baseline, perturbed) )

Near 1.0 means the decision moved when governing evidence changed. Near 0.0 means it didn't — the pipeline is ignoring evidence that should govern the outcome.

The maturity map

Inv high · Sens high

Mature

Inv high · Sens low

Flat

Inv low · Sens high

Brittle

Inv low · Sens low

Unstable

Interactive · simulator

Try it: dial the perturbations

Adjust how many reorder, pad, and swap perturbations Kelvin runs — and how your pipeline behaves. Watch the two scores and the regime move in response. Illustrative model, not the real estimator.

Perturbation counts

Reorder

Shuffle sections — should not move the decision.

4

Pad

Inject typed sections from other cases.

4

Swap

Replace the governing section — should move it.

4

Pipeline behavior

Drift under noise

How much the decision moves when noise changes. Lower is better.

0.18

Response to governing change

How much the decision moves when the governing section is swapped.

0.65

Invariance

0.83

mostly stable

Sensitivity

0.65

partially responsive

Regime

Grounded

High invariance and high sensitivity — the pipeline ignores noise and follows the governing evidence.

What you get

What you get

No service. No accounts. The whole tool fits in your terminal.

  • Local CLI

    One command, no service.

  • Example cases

    Bundled sample case, runs out of the box.

  • Inspectable reports

    Terminal memo + standalone HTML.

  • Plays with your pipeline

    Anything callable from the command line.

  • No labels needed

    Unsupervised by design.

  • Open source

    Apache 2.0, runs offline.

Low friction

Measure the maturity of the pipeline you already have.

No new stack. No labels. No rewrites. If your pipeline runs from the command line, Kelvin can probe it.

Install · 02 commands

Run it locally in under a minute

A sample case is bundled with the install. No setup, no service, no account.

Drop a kelvin.yaml in your project and point it at a folder of case files. Sample case bundled.

Or skip install

Try it in your browser — no install required.

Open in Colab

Honest scope

What Kelvin doesn't claim

Kelvin is a diagnostic, not a verdict. Knowing what it can't tell you is part of using it well.

  • not a truth metric

    Kelvin doesn't determine whether an answer is correct. It asks whether the output tracks evidence in the expected direction under controlled perturbations.

  • not a judge model

    No external model grades answers. Signals come from how the pipeline responds to structural changes in its input.

  • swaps are type-matched, not semantically validated

    Governing-unit substitution guarantees type compatibility, not that every swap is semantically well-formed in context. Sensitivity is a diagnostic, not a proof.

  • scoring ignores prose

    Kelvin scores a designated structured decision field. Free-form rationales are recorded for inspection but don't contribute to the score.

  • v1 uses declared section headers

    Types come from markdown headers you declare, not from an inferred schema. Practical, but a lightweight approximation of the full structural-oracle argument.

Read the full Limitations and Future Work sections in the whitepaper ↗.

FAQ

FAQ

Does Kelvin prove correctness?
No. It diagnoses whether the pipeline appears to follow evidence rather than presentation.
Is Kelvin using another AI judge model?
No. It uses controlled perturbations, not an external model grading answers.
Does Kelvin lock you into one framework?
No. Anything callable from the command line works.
What's a good result?
Both above 0.8 is mature. 0.6–0.8 is workable but worth investigating. Below 0.6 on either signals a real problem. A constant-output pipeline scores 1.0 / 0.0 — stability without understanding, which is why the pair matters.