Doc · 04 / Scoring
Scoring
Scoring is a diagnostic signal, not a metric. The spec phrasing is deliberate: indicator, diagnostic — never accuracy or correctness.
Distance function
Operates on the declared decision field only
Other fields (prose, narrative) are shown in the report for context but don't affect scores.
| Field type | distance(a, b) |
|---|---|
| enum / string | 0 if a == b else 1 |
| numeric | min(1, |a − b| / max(|a|, |b|, 1)) |
| arrays / nested | unsupported in v1 — kelvin init fails with a clear error |
Aggregation
Two formulas. Always together.
signal · 01 — invariance
Invariance = 1 − mean(distance(baseline, perturbed))
Over reorder + pad perturbations.
signal · 02 — sensitivity
Sensitivity = mean(distance(baseline, perturbed))
Over swap perturbations.
Scoring is a pluggable function — the runner is structured so a better scorer can replace the default without touching it.
Framing
Diagnostic, not definitive
Signals point; humans decide. A high invariance and a high sensitivity don't prove your pipeline is correct — they say it's reading substance, not presentation, and reacting to the right kind of change.
Execution
Serial. No caching. Local.
v1 stays simple. The user controls cost; the working directory holds everything.
Serial
One perturbation at a time.
v1 does not parallelize. The user controls cost with --only <case> during iteration. Parallel execution is a v2 concern.
No caching
Every run re-invokes the pipeline.
For every perturbation. v2 may add content-hash caching. v1 stays simple.
Working directory
kelvin check runs from CWD.
./kelvin/ is written there. kelvin.yaml is read from there.
Exit codes
Three outcomes, one process
kelvin check returns a process exit code so CI can fail loudly when configuration breaks.
| Code | Meaning |
|---|---|
| 0 | Success. Report written. Signals may still be concerning. |
| 1 | Config or cases-dir problem. kelvin.yaml invalid or unreadable. |
| 2 | Decision field missing or every baseline invocation failed. |
Roadmap · v2
Stage decomposition
v1 measures the pipeline end-to-end. v2 will decompose the score across retrieval, reranking, and generation so instability can be localised — not only detected. If your sensitivity is low, you'll know whether the wrong evidence was retrieved or the right evidence was ignored.
Scope
What's in v1, what isn't
In · v1
- 01Markdown-with-headers case format (one file per case).
- 02Three perturbation kinds: reorder, pad, swap.
- 03Shell-command pipeline invocation.
- 04Serial execution, no caching.
- 05Terminal report + self-contained HTML report.
- 06Structured-field scoring (enum, string, numeric).
Out · v2 and beyond
- 01Stage decomposition (retrieval vs. generation vs. reranking).
- 02Semantic-equivalence scoring via LLM judge.
- 03Parallelism and caching.
- 04Perturbation packs for specific verticals.
- 05Framework-native adapters (LangChain, LlamaIndex).
- 06Dashboards, history, alerts — anything continuous or hosted.
- 07Schema-inferred unit types and validity constraints.