Skip to content

Draft specification (v0 strawman)

Nothing here is locked

This is a strawman. It records the calls the prior art supports, the calls it leaves open, and the failure modes any real version has to answer. It is the seed of an eventual specification, not the specification. Syntax in particular is deliberately not final.

What markstay is

A source-level identity primitive for Markdown blocks. It assigns a stable identity to a logical block of content so that other tools can point at that block and keep pointing at it across edits.

It is not an annotation system. Annotation, transclusion, AI-assisted editing, and cross-references are consumers that build on the identity layer, not part of it.

The core model: identity, then evidence

Field Role Changes when content changes?
id stable logical identity no
hash drift detection yes
quote + prefix/suffix recovery evidence to re-find a detached marker n/a
  • id answers which block.
  • hash answers did it change since last seen.
  • quote (a W3C TextQuoteSelector-style snippet) answers where did it go when the marker is lost.

A hash mismatch with the marker present means "same block, changed content", not "new block". The id is identity; the hash and quote are never identity, only evidence.

Settled calls

These have clear prior-art support and a default recommendation.

Primary syntax: trailing HTML comment

The paragraph being identified.
<!-- msid:8f24 hash=sha256:7a9c -->
  • Invisible in GitHub-rendered Markdown; preserved in the raw .md source.
  • Needs no Markdown attribute extension; degrades to harmless source text on tools that do not understand it.
  • Docusaurus already uses end-of-block HTML comments for explicit ids, so there is precedent for exactly this shape.
  • Reads to an LLM as source metadata rather than prose to rewrite.

MDX profile: comment-expression form

HTML comments are invalid in MDX v2. Where the target is MDX, the same marker takes the JSX comment form:

The paragraph being identified.
{/* msid:8f24 hash=sha256:7a9c */}

.md uses the HTML-comment form; MDX uses this profile. One data model, two serialisations.

Marker placement: after the block

The marker follows the block it identifies. Precedent: Obsidian, kramdown, markdown-it-attrs, and Docusaurus all attach identity after content. It also suits left-to-right generation by agents. Attachment is defined over the CommonMark block tree, not blank-line heuristics, so the boundary around lists, nested blockquotes, and tables is unambiguous.

IDs: short generated by default, human-readable allowed

  • Default: a short opaque generated id, not derived from the block text, so it survives arbitrary edits to that text.
  • Allowed: human-readable ids for authored landmarks (msid:install-step).
  • Duplicate ids within a document are invalid; tools detect and offer repair.
  • UUIDs are permitted but never required (too token-heavy for dense coverage).

Block granularity (v1)

Identity is defined over CommonMark block nodes. v1 calls:

  • Lists: identify the list item, not the whole list.
  • Code fences: a marker after the closing fence identifies the whole fence.
  • Tables: a marker after the table identifies the whole table; row-level identity is deferred to a later extension.
  • Blockquotes: a marker after the quote identifies the whole quote.

Copy vs move

  • Move preserves the id.
  • Copy mints a new id by default (two logical blocks must not share identity).
  • Matches Notion / Logseq / Roam copy semantics.

Detached or stale markers are surfaced, not silently reattached

Borrowed from GitHub review comments: when a marker cannot be confidently mapped to its block, the tool marks it outdated rather than guessing a nearby block. Silent reattachment to the wrong block is worse than an explicit stale state.

Open questions

The survey does not settle these. They need decisions before a v1 spec.

  1. Attribute names and grammar. The msid: namespace with a positional id (msid:8f24) plus hash / quote keys: exact delimiters, reserved keys, quoted-value handling, and the extension namespace are still open.
  2. Hash normalisation. What exactly is hashed: line endings, trailing whitespace, list-relative indentation, code-fence content, marker exclusion. Without a precise rule, two implementations disagree on drift.
  3. Block-boundary attachment rules. The precise CommonMark-tree rule for which block a trailing marker binds to, especially nested lists and quotes.
  4. Quote/selector format. How much context (prefix/suffix length), how to handle repeated identical text, when a quote counts as a match.
  5. LLM preservation. Resolved by measurement, see the findings. The marker survives, but only under an explicit preservation contract. The conclusion is folded into the failure-mode mitigations below.

Failure modes any real version must answer

  • Marker detachment: an edit splits, merges, or moves a block and the marker orphans or binds to the wrong block. Mitigation: hash-drift check + quote recovery + explicit stale state.
  • Sanitiser stripping: a pipeline step removes HTML comments and ids vanish with no error. Mitigation: the MDX/attribute profiles, and consumers that detect a missing expected marker.
  • AI regeneration churn: an agent rewrites a whole document and drops or reassigns markers, breaking every reference. This is the primary failure mode for the target use case, and measurement confirms it is severe: a naive rewrite or "clean up" pass strips nearly every marker. The mitigations are therefore mandatory, not optional:
    1. inject an explicit preservation instruction into any agent that edits a markstay document, which restores survival to 100% in the evaluation;
    2. run a post-edit linter that detects dropped, duplicated, or relocated ids so silent loss becomes a caught error;
    3. use generated, non-semantic ids that give a model nothing to "improve".
  • Copy-paste duplication: a copied block carries its id, producing duplicates. Mitigation: copy mints a new id; tools detect and repair duplicates.
  • Granularity disagreement: implementations disagree on what a marker identifies. Mitigation: pin granularity over the CommonMark block tree.
  • Scope creep into an annotation product. Mitigation: keep the core to identity + resolution; annotation is a separate, layered spec.

Non-goals (v1)

  • Annotation, comment storage, threads.
  • Transclusion / embedding.
  • Row-level table identity, inline-span identity.
  • A backend, accounts, or a hosted registry.

Closest existing standards

  • Recovery anchoring: W3C Web Annotation Data Model selectors (TextQuoteSelector, TextPositionSelector).
  • Markdown syntax lineage: Pandoc / PHP Markdown Extra / kramdown {#id} attribute lists.
  • Markdown product precedent: Obsidian block references (^block-id).
  • Block-database precedent: Notion comments parented by block id; Logseq / Roam block references.
  • Revision / stale-state precedent: GitHub review comments (commit + path + line, with an outdated state).

See prior art for the full survey and source links.