Findings: dogfooding markstay on real docs¶
The marker-survival study asks whether LLMs keep a stay: token at all. This study
asks the next question: when an agent updates real documentation, does a stay actually
help preserve the block it names?
The public run uses two open corpora:
- FastAPI docs, 12 sampled pages.
- The Rust Book, 12 sampled chapters.
Each document was seeded with stays, then sent through realistic update tasks:
append, status, and consolidate. Each cell ran across Claude Haiku and Sonnet,
with and without the section 11 preservation instruction. After the model returned a
full rewritten document, the markstay linter compared before and after. Every
dropped id was judged as either a real content drop or a marker strip where the
content survived but the marker did not.
The question
On real documentation updates, how often does an LLM drop a tracked block, is the post-edit catch a useful guard or mostly friction, and how much does the preservation instruction change the outcome?
Public corpus results¶
Across FastAPI and the Rust Book together: 288 cells, 17,424 seeded stays.
| corpus | cells | stays | content drops | marker strips | blocked | guard/nag split |
|---|---|---|---|---|---|---|
| FastAPI | 144 | 11,784 | 220 | 1110 | 77 | 58% / 42% |
| Rust Book | 144 | 5640 | 89 | 414 | 36 | 53% / 47% |
The shape is clear: routine appends are safe, full regeneration is risky.
| task | cells | stays | content drops | marker strips | blocked | guard | nag |
|---|---|---|---|---|---|---|---|
| append | 96 | 5808 | 0 | 0 | 0 | 0 | 0 |
| status | 96 | 5808 | 19 | 89 | 30 | 11 | 19 |
| consolidate | 96 | 5808 | 290 | 1435 | 83 | 53 | 30 |
append never dropped a section and never stripped a marker. Rust Book status was
also clean; FastAPI status was harder, with 19 content drops. The dominant risk is
still the aggressive consolidate task, which asks the model to rewrite and merge the
document as a whole.
The instruction is the main control¶
The preservation instruction tells the editing agent to preserve every stay: marker
with its block. It is the mechanism that keeps ids stable; the linter catch is a
backstop after the fact.
| corpus | naive drops | instructed drops | drop reduction | naive strips | instructed strips | strip reduction |
|---|---|---|---|---|---|---|
| FastAPI | 173 | 47 | 73% | 870 | 240 | 72% |
| Rust Book | 82 | 7 | 91% | 366 | 48 | 87% |
This does not mean the instruction makes full-document regeneration harmless. It means that, across public docs and two models, the instruction is the strongest lever by far. It cuts both real drops and marker-only strips. The remaining catch is still valuable, but it should be treated as a backstop for risky rewrites, not the thing you rely on first.
Model choice changes the rate, not the failure mode. Haiku produced 234 content drops and 1025 marker strips across the public runs; Sonnet produced 75 drops and 499 strips. Both follow the same pattern: routine edits are quiet, full regeneration is where the risk lives, and the instruction cuts the damage.
Seed density decides whether the catch is a guard¶
The Rust Book was also run as a seed-density control. The documents stayed fixed; only marker density changed.
| seed density | cells scored | stays scored | content drops | marker strips | blocked | guard | nag | guard/nag split |
|---|---|---|---|---|---|---|---|---|
| dense, all-block | 48 | 4164 | 143 | 1236 | 43 | 22 | 21 | 51% / 49% |
| sparse, headings-only | 47 | 345 | 0 | 36 | 5 | 0 | 5 | 0% / 100% |
This is the most important refinement. Headings-only seeding is quiet and useful when section anchors are enough, but it mostly protects header identity. If a paragraph under the heading disappears, there is no paragraph stay to drop. The linter can only warn about hash drift on the surrounding section.
Dense block-level seeding creates the signal that turns the linter into a real guard: when a paragraph block disappears, its own id disappears too. The tradeoff is more source noise and more friction under full regeneration.
Private-corpus corroboration¶
A separate run on a private documentation repository of living tracking docs used the same harness shape, but those source documents are not public. The reported metrics corroborate the public-corpus shape at headings-only granularity:
- routine appends and status updates were mostly safe;
- aggressive full-document consolidation caused the real section drops;
- the preservation instruction eliminated real section drops in that corpus and cut marker stripping by about 96%;
- at headings-only density, the catch was mostly a nag, because many blocked commits were marker strips on content that survived.
The private run is useful because it uses actively maintained tracking docs. The public
FastAPI and Rust Book runs are the evidence to cite and inspect: the corpus, harness,
and every per-cell result ship in
tools/eval/dogfood/.
The boundary: rows and bullets¶
markstay v1 protects blocks. A Markdown table is one block. A tight list is one block. That means a dropped table row or tight-list bullet can leave the block's stay in place and only produce non-blocking hash drift.
A companion within-collection run measured this directly on row- and bullet-heavy tracking docs: 922 of 10,608 rows and bullets, 9%, were dropped, roughly 30x the private section-drop rate. The catch never blocked because of a row or bullet drop.
There is an opt-in count-diff catch for that loss (--check-collections): it blocks
when a kept stay's table or list ends up with fewer rows/bullets than it started with.
Measured precision is the reason it is opt-in, not a default: it is accurate where items
genuinely churn (≈75% on status refreshes, ≈94% on consolidations) but near-pure noise
on append/prose edits (≈7%), and it gets noisier on the more capable model (84% on
Haiku, 42% on Sonnet). A raw count cannot tell a benign reflow, two bullets merged into
one, a row reworded, from a real deletion, and fluent models reflow more. It also cannot
see a 1-for-1 row swap. So it is a targeted backstop for churn-heavy trackers, not a
general guard, and true per-row identity still needs the structural fix below.
The practical fix for bullets is structural: make each durable item its own block.
LOOSE list, drop one item: DROPPED_ID (blocking, caught)
TIGHT list, drop one item: HASH_DRIFT (non-blocking, silent)
Tables do not have the same escape hatch under the current spec. A Markdown table row is not its own block, so per-row identity needs either restructuring into per-row blocks or a future intra-block addressing extension.
Conclusions¶
- Install the preservation instruction. It is the active mechanism that keeps block ids attached during LLM edits.
- Keep the linter catch, but understand what it is catching. Dense block-level seeding can make it a real content-loss guard. Sparse headings-only seeding makes it a low-noise section identity layer and a nag-prone backstop.
- Seed for the durable unit you care about. If the unit is a paragraph or loose list item, give that block a stay. If the unit is a table row, markstay does not protect it yet.
Limitations¶
- The simulation forces full-document regeneration. A real agent using surgical edits should strip fewer unrelated markers.
- The content-drop / marker-strip split is judge-based. The structural linter findings are exact; the semantic "did this section survive?" call is an LLM judgment.
- The public-corpus run used two Claude models. The earlier marker-survival study covers more vendor families and shows the same broad instruction effect.
Reproduce it¶
The corpus and the per-cell results ship in
tools/eval/dogfood/,
so the numbers above are inspectable with no clone, no network, and no API key, the
same way the marker-survival eval ships its results.json. A key is only needed to
re-run the models:
pip install markstay # the published stamper (or: npm install -g markstay)
python prepare_public_corpus.py # rebuild corpus/ from the pinned upstream commits
export ANTHROPIC_API_KEY=... # only needed to re-run the models
python run_dogfood.py --docs-dir corpus/fastapi --sample 12 \
--models sonnet,haiku --arms naive,instructed --preserve-file PRESERVE.md
python run_dogfood.py --docs-dir corpus/rust-book --sample 12 \
--models sonnet,haiku --arms naive,instructed --preserve-file PRESERVE.md
The corpus is two pinned, stamped samples (FastAPI docs and the Rust Book; commits and
attribution in corpus/manifest.json and corpus/NOTICE). Stamping mints fresh ids
each run, so a regenerated corpus is structurally identical to the committed copy and
the results reproduce within model noise.