Checks#

A check asserts one condition on an item and reports a violation when the item fails it. The check engine turns a collection’s configuration into a verdict on each item: it resolves the checks that apply, runs them, and collects their violations. This page explains the model the engine is built on, how check libraries supply and run checks, and why the pieces are shaped the way they are. For the per-type catalog see the check types reference; for the end-to-end data flow of one check invocation see the domain model.

Terms#

Term	Meaning
Check	Shorthand for a check instance when context is unambiguous. A check asserts one condition and reports a violation when the condition fails.
Check type	The reusable definition of a constraint: `object_required_field`, `markdown_single_h1`, and so on. A check type is selected by its `kind:` id and appears in the generated check types reference.
Check instance	One configured check attached to a collection or filesystem scope: a check type plus its arguments, written as one YAML object under `checks:`. The type is the rule; the instance is the rule applied here.
Family	The kind of source data a check type reads: `structuredObject` (frontmatter), `markdownBodyText` (the body), `fileSystem` (names and paths), or `plainText` (raw body text).
Attachment target	The config site where a check instance is valid: `collection` for collection `checks:`, or `filesystem` for base-level `filesystemChecks`.
Check library	The provider that supplies and runs a check type. Native libraries wrap hand-written checks; schema-backed libraries delegate to an external validation engine.
Runtime granularity	The level where a check runs. Most checks are file-scoped; a few are file-set-scoped and reason across every selected file.
Severity	The consequence of a violation. `error` fails the run; `warning` is advisory and does not change the exit code.
Violation	One failed check result, with a message, source location, JSON pointer when applicable, severity, and sometimes a sibling file for collection-scoped findings.

Family, library, configuration site, and runtime granularity are separate axes. Family answers what data does this check read? Library answers who runs it? Attachment target answers where can a user configure it? Runtime granularity answers does it run once per file or once per selected file set? A single family can span libraries: structuredObject includes both object from the JSON Schema library and object_required_field from the native structured-object library.

The registry is the single source of truth for check types. Each check type self-registers a Descriptor (its id, family, configurableIn, docs metadata) and a constructor. cmd/engine builds the runnable list by registry lookup; the docs generator and katalyst check-types list read the same descriptors. A parity test fails if a configured kind has no descriptor, so a check type cannot ship undocumented.

Configuration Sites#

Collection-attached checks live under a collection’s checks: list. They run after selector resolution and can use schema precedence, variants, and the collection’s full sibling set. This is the historical model and remains the right place for rules that depend on collection identity.

Filesystem-attached checks live under a filesystem base’s filesystemChecks list. Each scope selects raw files with include and exclude globs. A no-selector katalyst check runs filesystem scopes before collection checks, so a project can enforce path policy before any collection exists. Explicit collection selectors stay collection-only.

The descriptor’s ConfigurableIn list controls where a check type may be configured. During migration, an empty list means collection. File-system-family checks that only read paths can usually support both configuration sites. Checks that read frontmatter or body text declare that they need a parsed document so filesystem scopes parse lazily and path-only scopes avoid unnecessary document work.

Check libraries#

A CheckLibrary is the provider behind a check type. It is the seam that lets Katalyst delegate work to engines it does not implement itself.

Libraries come in two kinds:

Native libraries wrap hand-written checks: filesystem, plaintext, markdownbodytext, structuredobject. Their logic is Go code in the repository.
Schema-backed libraries delegate to an external engine that compiles a named schema and runs items against it. JSON Schema (internal/checks/jsonschema) is the first and currently only one; a prose linter such as Vale is the next candidate.

A schema-backed library does three things: it compiles a schema from source bytes, runs an item’s data against the compiled schema, and maps the engine’s findings back into Katalyst violations. The JSON Schema library wraps santhosh-tekuri/jsonschema/v6 behind this contract, which keeps the rest of the codebase off the underlying library, lets the input normalize from YAML-native types into the JSON shape the validator expects, and flattens the library’s nested error tree into a flat violation list with source lines.

Schema is the Katalyst concept, not the JSON Schema document. A schema defines a collection’s shape; JSON Schema is one way to express it, a Vale style config is another. Schemas are stored flat under .katalyst/schemas/ and named by filename stem. The library that compiles a given schema is resolved at the binding site, from the referencing check type’s kind: kind: object resolves its schema through the JSON Schema library. There is no per-library directory and no migration when a second library arrives.

Availability is a hard error. A library reports whether it can run. An in-process library is always available; an out-of-process tool probes for its binary and an acceptable version. The engine checks availability before running a library’s checks, and a missing or too-old tool fails the run rather than silently skipping enforcement, so a misconfigured CI fails loudly.

Libraries are in-process or out-of-process. JSON Schema runs in the same process. An out-of-process library shells out to a separate binary, which adds the availability concern above and a performance one: launching a process per item is wasteful, so a batched run over a whole collection (mapping findings back to files by name) is the planned optimization, tracked in #68. The first cut runs per item, the simplest correct path.

Running a check#

Per item, the engine resolves which checks apply, then runs them.

Resolution starts from the collection’s configured checks and adds the checks of the first variant whose when predicates the item’s metadata satisfies. The object schema is selected by a precedence the JSON Schema library owns (a forced --schema, then an inline schema: directive, then the collection’s object checks); see the domain model for the precedence table and the full per-item lifecycle. Before any schema compiles, the engine confirms the owning libraries are available.

Running is uniform: every file check returns a list of violations, which the engine concatenates; file-set checks run in a second pass over the selected files. For collections, that file set is the whole collection. For filesystem checks, it is the scope’s selected file set plus its unmatched files. A violation carries a JSON-pointer Path, a Message, a source Line, an optional File (for file-set findings that name a sibling), and a Severity. An item with no violations prints path: OK.

Design rationale#

Self-registration over a central dispatch. A check type owns its descriptor and constructor in one file. The registry is assembled from those init() calls, wired in by a single blank-import aggregator. The cost is that the import must be present for the catalog to populate; the benefit is that adding a check type never edits a shared switch, and the parity test keeps the registry and the config grammar in lockstep.

Family and library are separate axes on purpose. Earlier the object check was filed only by its family and its provider was implicit. Splitting provenance (library) from source-data kind (family) lets a second engine supply a check type into an existing family without disturbing the native checks there: a prose linter joins markdownBodyText beside the native markdown checks. Family stays the documentation and routing axis; library stays the execution axis.

A thin wrapper over each external engine. The codebase depends on the Schema interface, not on santhosh-tekuri/jsonschema. That keeps the engine swappable, isolates input normalization and error flattening to one place, and gives every library the same shape, so the engine compiles, caches, and gates them identically regardless of whether the work happens in-process or in a subprocess.

File-set checks re-scan the whole collection when attached to a collection. A uniqueness or required-index verdict is only correct against every item, so these checks run a second pass over the full collection even under a single-item selector. The trade is that katalyst check notes/one.md does more work than its name suggests; the alternative (a per-item approximation) would report wrong answers.

Trade-offs and alternatives#

One registry, not two. Check types and their libraries share a single registry; every check type names its owning library through its descriptor. A separate library registry was rejected: it would split “which check types exist” from “who owns them” and force the engine to consult both.

Flat schemas, not per-library directories. Namespacing schemas under .katalyst/schemas/<library>/ looked tidier but would migrate every existing schema path for no functional gain, since the binding’s kind already names the library. Schemas stay flat.

Per-item invocation first, batching later. Running an out-of-process tool once per item is slow at scale but simple and correct. Batching a collection into one invocation is the optimization, deferred to #68 rather than built before a real out-of-process library exists.

Invariants#

The registry is authoritative. Every runnable check type has a descriptor, and generated docs read the same registry as the engine.
Family and library stay separate. Family describes the data a check reads; library describes the provider that runs it.
Collection-attached file-set checks see the whole collection. A selector may narrow output, but a collection-level verdict still needs the full sibling set.