Doc · 04 / Scoring

Scoring

Scoring is a diagnostic signal, not a metric. The spec phrasing is deliberate: indicator, diagnostic — never accuracy or correctness.

Distance function

Operates on the declared decision field only

Other fields (prose, narrative) are shown in the report for context but don't affect scores.

Field typedistance(a, b)
enum / string0 if a == b else 1
numericmin(1, |a − b| / max(|a|, |b|, 1))
arrays / nestedunsupported in v1 — kelvin init fails with a clear error

Aggregation

Two formulas. Always together.

signal · 01 — invariance

Invariance = 1 − mean(distance(baseline, perturbed))

Over reorder + pad perturbations.

signal · 02 — sensitivity

Sensitivity = mean(distance(baseline, perturbed))

Over swap perturbations.

Scoring is a pluggable function — the runner is structured so a better scorer can replace the default without touching it.

Framing

Diagnostic, not definitive

Signals point; humans decide. A high invariance and a high sensitivity don't prove your pipeline is correct — they say it's reading substance, not presentation, and reacting to the right kind of change.

Execution

Serial. No caching. Local.

v1 stays simple. The user controls cost; the working directory holds everything.

Serial

One perturbation at a time.

v1 does not parallelize. The user controls cost with --only <case> during iteration. Parallel execution is a v2 concern.

No caching

Every run re-invokes the pipeline.

For every perturbation. v2 may add content-hash caching. v1 stays simple.

Working directory

kelvin check runs from CWD.

./kelvin/ is written there. kelvin.yaml is read from there.

Exit codes

Three outcomes, one process

kelvin check returns a process exit code so CI can fail loudly when configuration breaks.

CodeMeaning
0Success. Report written. Signals may still be concerning.
1Config or cases-dir problem. kelvin.yaml invalid or unreadable.
2Decision field missing or every baseline invocation failed.

Roadmap · v2

Stage decomposition

v1 measures the pipeline end-to-end. v2 will decompose the score across retrieval, reranking, and generation so instability can be localised — not only detected. If your sensitivity is low, you'll know whether the wrong evidence was retrieved or the right evidence was ignored.

Scope

What's in v1, what isn't

In · v1

  • 01Markdown-with-headers case format (one file per case).
  • 02Three perturbation kinds: reorder, pad, swap.
  • 03Shell-command pipeline invocation.
  • 04Serial execution, no caching.
  • 05Terminal report + self-contained HTML report.
  • 06Structured-field scoring (enum, string, numeric).

Out · v2 and beyond

  • 01Stage decomposition (retrieval vs. generation vs. reranking).
  • 02Semantic-equivalence scoring via LLM judge.
  • 03Parallelism and caching.
  • 04Perturbation packs for specific verticals.
  • 05Framework-native adapters (LangChain, LlamaIndex).
  • 06Dashboards, history, alerts — anything continuous or hosted.
  • 07Schema-inferred unit types and validity constraints.