Doc · 01 / Concept

The idea

RAG pipelines read corpora of discrete, typed units — interviews, clauses, rules, records. That structure lets us define metamorphic relations. Reordering units shouldn't change the answer, swapping a governing unit should. Kelvin runs these perturbations and measures the result. No labels. No judge model.

Two signals

Either alone is a trap

A pipeline that always says “unclear” is perfectly invariant and useless. A pipeline that flips on every breeze is sensitive and unusable. Understanding requires both.

signal · 01 — invariance

Stays calm when nothing important changes

Reorder units, pad with unrelated ones from other cases. The facts are identical; the output should be too. Drift means the pipeline is reacting to presentation, not substance.

Derived from: reorder + pad
Formula: 1 − mean(distance)

signal · 02 — sensitivity

Reacts when something important does

Swap a governing unit for a different valid one drawn from another case. The facts changed; the output should too. Flatness means the pipeline isn't reading that evidence at all.

Derived from: swap
Formula: mean(distance)

What this isn't

Kelvin is a diagnostic signal, not a correctness metric. It tells you where to look, not whether you're right.

How it runs

From a case to two numbers

Kelvin owns the rail. Your pipeline is the dashed box — invoked once per perturbation as a subprocess, no SDK, no decorators.

Schematic · end-to-end

one case · one run

solid · kelvin owns

dashed · you own

— derived from reorder + pad

— derived from swap

Perturbations · 03

The three disturbances

Each kind generates 3 variants per case, per perturbation type. Deterministic by seed.

Reorder

reorder

Shuffle the unit order within the case. Preamble stays first. Tests invariance.

Edge — Always runs.

before

interviewinterviewgate_rulebudget

after

gate_ruleinterviewbudgetinterview

Pad

pad

Insert 2–4 units drawn from other cases in the same run, regardless of type, at random positions. Tests invariance.

Edge — Skipped if only one case in the run — no peers to draw from.

before

interviewinterviewgate_rulebudget

after

interviewforeigninterviewgate_ruleforeignbudget

Swap

swap

Replace one unit of a governing type (declared in kelvin.yaml) with a unit of the same type from another case. Tests sensitivity.

Edge — Type match is the only validity check in v1 — a deliberately crude approximation. Skipped if the case has no governing units.

before

interviewinterviewgate_rulebudget

after

interviewinterviewgate_rule′budget

Honest scope · v1

Diagnostic, not definitive

Kelvin produces diagnostic signals — not truth metrics. They tell you where to look, not whether your pipeline is correct.

The stronger version of this idea uses a full schema to derive typed units and validity constraints automatically. In v1, types come from user-declared markdown section headers — a lightweight convention that approximates the schema story. Good enough to produce a real signal, honest enough not to oversell.

Colophon · why this exists

Every RAG eval I've run ends the same way: a number that goes up when the prompt changes and nobody knows why. Labels are expensive, judges are biased, and “does it look right” is not a metric.

Kelvin is the smallest honest thing I could build instead. It makes one claim — your pipeline either reads the evidence or it doesn't — and runs the experiment that distinguishes the two. Two numbers, never one. Diagnostic, not definitive.

If it tells you something you didn't already know, it earned its place in your repo.

— the kelvin authors

Read the whitepaper →

Status · v1

What's shipped, what's next

Honest snapshot of the open-source repo. No green dots — text only.

Component	Status
Core perturbations (reorder, pad, swap)	Shipped
Scorer	Shipped
CLI (kelvin check)	Shipped
Terminal report	Shipped
HTML report	In progress
kelvin init wizard	Upcoming
Stage decomposition (retrieval / reranking / generation)	v2
CI/CD integration	Upcoming

Source on GitHub →Read the full spec →