Doc · 01 / Concept
RAG pipelines read corpora of discrete, typed units — interviews, clauses, rules, records. That structure lets us define metamorphic relations. Reordering units shouldn't change the answer, swapping a governing unit should. Kelvin runs these perturbations and measures the result. No labels. No judge model.
Two signals
A pipeline that always says “unclear” is perfectly invariant and useless. A pipeline that flips on every breeze is sensitive and unusable. Understanding requires both.
signal · 01 — invariance
Reorder units, pad with unrelated ones from other cases. The facts are identical; the output should be too. Drift means the pipeline is reacting to presentation, not substance.
signal · 02 — sensitivity
Swap a governing unit for a different valid one drawn from another case. The facts changed; the output should too. Flatness means the pipeline isn't reading that evidence at all.
What this isn't
Kelvin is a diagnostic signal, not a correctness metric. It tells you where to look, not whether you're right.
How it runs
Kelvin owns the rail. Your pipeline is the dashed box — invoked once per perturbation as a subprocess, no SDK, no decorators.
Schematic · end-to-end
one case · one run
solid · kelvin owns
dashed · you own
— derived from reorder + pad
— derived from swap
Perturbations · 03
Each kind generates 3 variants per case, per perturbation type. Deterministic by seed.
Shuffle the unit order within the case. Preamble stays first. Tests invariance.
Edge — Always runs.
before
after
Insert 2–4 units drawn from other cases in the same run, regardless of type, at random positions. Tests invariance.
Edge — Skipped if only one case in the run — no peers to draw from.
before
after
Replace one unit of a governing type (declared in kelvin.yaml) with a unit of the same type from another case. Tests sensitivity.
Edge — Type match is the only validity check in v1 — a deliberately crude approximation. Skipped if the case has no governing units.
before
after
Honest scope · v1
Kelvin produces diagnostic signals — not truth metrics. They tell you where to look, not whether your pipeline is correct.
The stronger version of this idea uses a full schema to derive typed units and validity constraints automatically. In v1, types come from user-declared markdown section headers — a lightweight convention that approximates the schema story. Good enough to produce a real signal, honest enough not to oversell.
Colophon · why this exists
Every RAG eval I've run ends the same way: a number that goes up when the prompt changes and nobody knows why. Labels are expensive, judges are biased, and “does it look right” is not a metric.
Kelvin is the smallest honest thing I could build instead. It makes one claim — your pipeline either reads the evidence or it doesn't — and runs the experiment that distinguishes the two. Two numbers, never one. Diagnostic, not definitive.
If it tells you something you didn't already know, it earned its place in your repo.
— the kelvin authors
Read the whitepaper →Status · v1
Honest snapshot of the open-source repo. No green dots — text only.
| Component | Status |
|---|---|
| Core perturbations (reorder, pad, swap) | Shipped |
| Scorer | Shipped |
| CLI (kelvin check) | Shipped |
| Terminal report | Shipped |
| HTML report | In progress |
| kelvin init wizard | Upcoming |
| Stage decomposition (retrieval / reranking / generation) | v2 |
| CI/CD integration | Upcoming |