Deterministic DNA Interpretation Whitepaper (Mouse/mm39) — v1.2

DOI Whitepapers

Keywords: DNA interpretation; genome windowing; mouse mm39 (GRCm39); GC content; CpG density; repeat masking; reproducibility; audit trail; SHA-256; Zenodo.

1 Front matter

1.1 Purpose and claim boundary

This whitepaper is written to support a precise, testable, and reproducible claim:

Claim 1 (Operational “DNA 100% interpretation” claim). Within the LOCK–Derive–Gate DNA Interpreter DNA contract, DNA is 100% interpretable in the following operational sense: for any genomic region of interest (sequence plus minimal annotations), the pipeline deterministically produces (i) an A4 arrangement (stiffness shells, anchors, motors, loops) and (ii) a gate-validated irreversible event log (J_LEDGER) generated under a fixed 4-activity grammar (INIT / SCONSERV / SDISSIP / JEVENT).

This claim is deliberately not a claim about fully solving biological function. Rather, it asserts that the interpretation output objects are always produced (or a failure is explicitly labeled), and that the entire process is auditable and reproducible under a LOCK\(\rightarrow\)Derive\(\rightarrow\)Gate regime .

1.2 Scope: what is included and what is not

1.2.0.1 Included.

A deterministic mapping from 1D sequence(+annotation) to A4 primitives.
A deterministic, seed-controlled mapping from A4 to an irreversible J_LEDGER with admissibility gates.
A mouse mm39 whole-genome autoscan that selects a representative region panel (N\(\approx\)30–100) by balancing four structural axes (gene density, repeat proxy, GC fraction, CpG density), with coverage summarized as Table 3 and Figure [fig:archetype_grid].
Reproducibility artifacts: DATA_LOCK, MANIFEST (sha256), INDEX, and validation/DOI audit scripts.

1.2.0.2 Not included.

A claim that the model predicts biological phenotype or all regulatory mechanisms.
A claim that the stiffness proxy is a universally correct physical observable for DNA in vivo.
A claim that the chosen minimal annotations are complete (they are sufficient for MOTOR extraction, not for full biology).

1.3 Document conventions and terminology

1.3.0.1 Interpretation vs. meaning.

Throughout this document, “interpretation” means producing A4 and J_LEDGER objects from inputs under versioned rules. It does not mean recovering the “true meaning” of DNA in an absolute biological sense.

1.3.0.2 LOCK vs. STATE.

We separate LOCK (frozen assumptions and rules) from STATE (allowed run-time values). A change to LOCK is treated as a new version; post-hoc tuning of LOCK is disallowed in a validated run .

1.3.0.3 Determinism.

Unless explicitly noted, procedures are deterministic functions of:

input region sequence (FASTA),
annotation subset used for MOTOR extraction (GTF),
parameter dictionary (including KEY/PCTS versions),
random seed (only where explicitly used).

1.4 Reproducibility contract (LOCK–Derive–Gate)

We adopt a strict reproducibility contract with three layers :

LOCK: Versioned priors, schemas, and algorithms (e.g., the 4-activity grammar; A5 priors; KEY and PCTS versions).
Derive: All output files are derived from LOCK + inputs without hidden manual steps.
Gate: Every run produces fail-fast gates that check admissibility (e.g., \(\Delta E<0\) for JEVENT) and audit completeness.

A run is considered validated when:

inputs are recorded with sha256 hashes (DATA_LOCK),
outputs are recorded with sha256 hashes (MANIFEST),
gates return PASS (or explicit FAIL/INCONCLUSIVE with reasons),
the DOI registry has no missing DOI entries.

1.5 How to cite

Zenodo DOI (this whitepaper + reproducibility bundle):

External resources are cited via DOI-bearing references (see bibliography). Internal derived artifacts are cited by run_id and sha256 hashes recorded in DATA_LOCK and MANIFEST. When distributing results, include the validated bundle and its registry/audit scripts to allow third-party verification .

1.6 Evidence lock (fixed artifacts for this whitepaper)

Zenodo upload (recommended): upload this whitepaper PDF together with the wrapper bundle , which contains the locked evidence bundles listed below.

To make the evidence supporting Table 3 and Fig. [fig:archetype_grid] immutable and auditable, we freeze the exact evidence bundles and key files by sha256 below. Any change to these hashes (inputs, selection, or generated summaries) requires a new evidence version and a new whitepaper revision.

EVIDENCE_LOCK_v0_1

(1) mm39 archetype coverage bundle (Table 1 + Fig 1 sources)
  mm39_archetype_coverage_TABLE1_FIG1_bundle_v0_1.zip
    sha256: d9adeb446efe6ae2e1e21ef0202ea0ed92d042f5bcaaae573f67a879f15270a8
  key files inside the bundle:
    TABLE1_mm39_archetype_coverage_v0_1.csv
      sha256: c61ec72d129a899f9787e5cead3047e54004d223c3ae9bd02186c5ecc4239d77
    mm39_selected_regions_archetypeN80_v0_1.csv
      sha256: 9519b2a13164194b50043135ce021d7054c93b90c249caa7462b2ecca2c021a3
    mm39_archetype_coverage_matrix9x9_v0_1.csv
      sha256: 4881c61704f018d866e786036c383c708d20c224fd8b2a185d7c1540ff202abf
    RUN_METADATA_mm39_archetype_coverage_v0_1.json
      sha256: 309c920feb61c277ee62e67c47ec0744328a873a950f30b5950754f14ead01f7
    MANIFEST_mm39_archetype_coverage_v0_1.json
      sha256: fe7adc4fb201d31cb243c6f1e765c2248d230e59eaf1989a66be2a7475cf7f99
    mm39_inputs_sha256_v0_1.txt
      sha256: f8ea0caa5a009da6df7b3a77585f528999f56efedced5be6b48835db5b42dce5

(2) End-to-end autoscan+STEP18 bundlekit (scripts and LOCK files)
  mouse_mm39_autoscan_END2END_bundlekit_v1_0.zip
    sha256: e2e2803daa972e276b9710db39f450055ad9fdfd44e0cfab974c3a592b826ac2
  key files inside the bundlekit:
    MANIFEST.json
      sha256: 024bf5ed17fb984fa17ca9bef586c0b72883488352ba3d142422457126604e50
    CITATION_REGISTRY.yaml
      sha256: 69ca4f35e13be7c7a24bdf04d7244d1798d38ac35835a2d5d399b17a8abf8db7
    STANDARD/A5_priors_dna_v0_1.yaml
      sha256: 0fc1b857956b2aa5c277fb919ac56c935fc524707853e83210ab2637477f8f8d
    STANDARD/J_LEDGER_schema_dna_v0_1.yaml
      sha256: 1cc118ce979eecdb7b24bd93efbf2efc572941ab1057e6ae0f77f5a9f4a4b769

1.6.0.1 Evidence lock in table form (summary).

For readability, Table 1 summarizes the two primary evidence bundles and their sha256 hashes. Hashes are grouped in blocks of 8 characters in the table; the canonical unbroken hashes remain in the verbatim block above.

Evidence lock summary (primary bundles).
Artifact (file)	sha256 (grouped)
	`d9adeb44 6efe6ae2 e1e21ef0 202ea0ed 92d042f5 bcaaae57 3f67a879 f15270a8`
	`e2e2803d aa972e27 6b9710db 39f45005 5ad9fdfd 44e0cfab 974c3a59 2b826ac2`

1.6.0.2 Evidence lock in table form (key files).

Table 2 lists key files inside the bundles whose hashes are explicitly locked for this manuscript.

Evidence lock: key files (sha256).
Path (inside bundle)	sha256 (grouped)
Path (inside bundle)	sha256 (grouped)
(continued)
	`c61ec72d 129a899f 9787e5ce ad3047e5 4004d223 c3ae9bd0 2186c5ec c4239d77`
	`9519b2a1 3164194b 50043135 ce021d70 54c93b90 c249caa7 462b2ecc a2c021a3`
	`4881c617 04f018d8 66e78603 6c383c70 8d20c224 fd8b2a18 5d7c1540 ff202abf`
	`309c920f eb61c277 ee62e67c 47ec0744 328a873a 950f30b5 950754f1 4ead01f7`
	`fe7adc4f b201d31c b243c6f1 e765c224 8d230e59 eaf1989a 66be2a74 75cf7f99`
	`f8ea0caa 5a009da6 df7b3a77 585f5289 99f56efe dced5be6 b48835db 5b42dce5`
`MANIFEST.json (bundlekit)`	`024bf5ed 17fb984f a17ca9be f586c0b7 28834883 52ba3d14 24224571 26604e50`
`CITATION_REGISTRY.yaml (bundlekit)`	`69ca4f35 e13be7c7 a24bdf04 d7244d17 98d38ac3 5835a2d5 d399b17a 8abf8db7`
	`0fc1b857 956b2aa5 c277fb91 9ac56c93 5fc52470 7853e832 10ab2637 477f8f8d`
	`1cc118ce 979eecdb 7b24bd93 efbf2efc 572941ab 1057e6ae 0f77f5a9 f4a4b769`

2 Executive summary

2.1 One-page pipeline overview

The LOCK–Derive–Gate DNA Interpreter DNA pipeline is designed to make interpretation reproducible and auditable:

Input: a genomic region sequence (FASTA) and optional transcript annotations (GTF) restricted to the region.
KEY step (sequence \(\rightarrow\) structure): compute windowed sequence statistics, form a stiffness proxy, segment into shells, and define anchors; extract motors from TSS and connect to nearby anchors to form loops. The result is a minimal A4 arrangement.
PCTS step (structure \(\rightarrow\) ledger): given A4, generate a time-indexed ledger of irreversible events (J_LEDGER) under a fixed 4-activity grammar.
Audit/validation: produce DATA_LOCK, MANIFEST, INDEX, and gate tables; validate that the run is admissible and reproducible.

2.2 Core claim restated as an engineering contract

Claim 2 (Interpretation contract). For any input region, the pipeline either produces a valid A4 and a gate-admissible J_LEDGER, or it returns an explicit failure label with sufficient diagnostics to reproduce the failure. No post-hoc LOCK tuning is permitted in a validated run.

This is the sense in which we use “DNA is 100% interpretable” in this whitepaper .

2.3 Key design decisions

2.3.0.1 Activity grammar is fixed.

We do not introduce new activity categories beyond: INIT / SCONSERV / SDISSIP / JEVENT. All diversity is expressed by A4 arrangement structure.

2.3.0.2 Structure comes from shells/anchors, not from adding activities.

The KEY step is responsible for producing diverse arrangements via stiffness-shell segmentation and anchor strengths.

2.3.0.3 Irreversibility is logged, not assumed.

JEVENT entries must satisfy dissipative constraints (\(\Delta E < 0\)) and non-increasing residual stress across events.

2.4 Evidence at scale: mm39 autoscan and representative panels

To avoid relying on a small hand-picked region set, we run a whole-genome autoscan over mouse mm39. Each window is summarized by four structural observables:

gene density (TSS per Mb),
repeat proxy (softmask fraction),
GC fraction,
CpG density.

Each axis is discretized into L/M/H (terciles), producing a \(3^4=81\) archetype grid. We then select a representative panel (N\(\approx\)30–100) with the goal of maximizing archetype coverage and maintaining approximate uniformity across covered archetypes.

2.4.0.1 What Table 1 and Figure 1 establish.

Table 3 records bin cutoffs \((q_1,q_2)\) for each axis and the marginal balance of selected regions across L/M/H.
Figure [fig:archetype_grid] records archetype occupancy in a flattened 9\(\times\)9 grid, showing how uniformly the N-budget covers the observed archetype space.

2.5 Reproducibility statement (minimal)

The evidence is packaged as validated bundles:

each run has DATA_LOCK (input hashes and parameters),
each bundle has MANIFEST (sha256 for outputs),
gate tables and verdicts are included,
DOI audit is included for citations.

A third party can reproduce the same run_id outputs by re-running the pipeline with identical inputs and parameters .

2.6 Reading guide

Sections 2–3 define the formal objects and the reproducibility contract. Sections 4–5 define the deterministic transforms (KEY and PCTS). Sections 6–8 provide mouse mm39 data provenance and autoscan panel results (Table 1 and Fig. 1). Sections 9–10 interpret the emergence of stiffness-shell structure and document failure modes.

3 Core definitions

This section fixes the objects and terminology used throughout the paper.

3.1 Input region

Definition 1 (Region object). A region is a tuple \[\mathcal{R} := (S, \mathcal{A}, \theta),\] where \(S\) is a DNA sequence string of length \(L\) over \(\{A,C,G,T,N\}\), \(\mathcal{A}\) is an optional annotation set restricted to the region (e.g., transcripts), and \(\theta\) is a parameter dictionary (window size, step size, algorithm versions, and seed).

Remark 1 (Minimal annotation). Annotations are optional. When \(\mathcal{A}\) is absent, the pipeline still produces shells/anchors; motors/loops may be empty. This is admissible, but it changes what can be claimed (e.g., no TSS-driven MOTOR structure).

3.2 A4 arrangement

Definition 2 (A4 layout). An A4 layout is a structured object \[\mathrm{A4} := (\text{shells}, \text{anchors}, \text{motors}, \text{loops}, \text{meta}),\] where meta records version and provenance (input headers, region length, etc.).

Shells

Definition 3 (Stiffness shells). A shell is a contiguous segment \([s,e)\) of the region produced by segmenting a stiffness proxy signal derived from windowed sequence statistics. Shells are labeled by a coarse class (e.g., terciles) and carry a mean normalized value used for boundary strength.

Anchors

Definition 4 (Anchor). An anchor is a tuple \((\mathrm{id}, p, k, \alpha)\) where \(p\) is a genomic position in bp, \(k\) is an anchor kind (region edge, shell boundary, optional shell core), and \(\alpha\ge 0\) is an anchor strength (typically a boundary discontinuity magnitude).

Motors

Definition 5 (Motor). A motor is a transcript-driven position (TSS) extracted from annotations, represented as \((\mathrm{id}, p, \mathrm{strand})\) in local coordinates.

Loops

Definition 6 (Loop). A loop is a coupling between a motor and an anchor. In the minimal representation we record:

loop_id: unique identifier,
motor_id: the MOTOR endpoint (TSS position),
anchor_id: the ANCHOR endpoint (boundary/core/edge),
\(d\) (bp): the motor–anchor distance in local coordinates.

Given a fixed integer \(k\ge 1\), loop sets are generated by connecting each motor to its nearest \(k\) anchors under a deterministic rule.

3.3 Fixed activity grammar (4 operations)

Definition 7 (Activity grammar). The activity grammar is the ordered set of allowed activity categories: \[\{\mathrm{INIT}, \mathrm{SCONSERV}, \mathrm{SDISSIP}, \mathrm{JEVENT}\}.\] No additional activity categories are introduced. Any model diversity must arise from A4 structure and event logs, not from adding activity types.

3.3.0.1 Interpretation.

INIT initializes a run state. SCONSERV is a conservative projection step (constraint satisfaction). SDISSIP is a dissipative relaxation step. JEVENT is an irreversible discrete update that advances time and writes to the ledger.

3.4 J_LEDGER and gates

Definition 8 (J_LEDGER entry). A J_LEDGER entry is a row with required fields including run_id, event_id, event_type, time step, involved entity identifiers (anchors/motors/loops), trigger summary, dissipated energy \(\Delta E\), and pre/post residual stress proxies.

Definition 9 (Gate outcomes). Gate evaluation returns one of: PASS, FAIL, or INCONCLUSIVE. FAIL indicates a violated admissibility constraint (e.g., \(\Delta E\ge 0\) for JEVENT). INCONCLUSIVE indicates that a check could not be resolved under available data but is explicitly labeled.

3.5 Operational meaning of “100% interpretable”

Definition 10 (Operational interpretability). A region \(\mathcal{R}\) is interpretable under LOCK–Derive–Gate DNA Interpreter DNA if the pipeline deterministically produces:

an A4 layout (possibly with empty motors/loops when annotations are absent), and
a J_LEDGER and verdict (PASS/FAIL/INCONCLUSIVE) with complete audit artifacts (DATA_LOCK and MANIFEST).

Proposition 1 (Deterministic mapping under fixed LOCK). Fix LOCK (versions, schemas, priors), fix inputs \((S,\mathcal{A})\), fix parameters and seed. Then the mapping \[(S,\mathcal{A},\theta) \xrightarrow{\ \mathrm{KEY}\ } \mathrm{A4} \xrightarrow{\ \mathrm{PCTS}\ } (\textbf{J\_LEDGER},\mathrm{verdict})\] is deterministic, and the run_id is stable (hash of inputs+params+versions+seed).

Remark 2 (What can change). If STATE parameters are changed within allowed ranges, the run_id changes because parameters are included in the hash. If LOCK is changed (e.g., new segmentation rule), the version must be bumped and comparisons must be explicit.

3.6 Why archetype panels matter

The “100% interpretable” claim is global (any region can be processed), but evidence must be compact. Representative archetype panels provide a finite, auditable set of regions that cover the combinations of structural axes that drive A4 diversity. The autoscan selection algorithm and its coverage (Table 1 and Fig. 1) are therefore treated as core evidence, not as an optional demonstration .

4 Data contract (LOCK–Derive–Gate packaging)

This section specifies the reproducibility contract at the level of files, hashes, and validation outcomes. It is intentionally strict: a run is either validated (PASS), invalid (FAIL), or explicitly unresolved (INCONCLUSIVE) with diagnostics.

4.1 Design principles

No silent drift. Inputs and outputs are hashed (sha256) and recorded.
Stable identity. A run has a stable run_id derived from inputs and parameters.
Idempotence. Re-running the same run_id produces the same directory structure; an existing run with MANIFEST is skipped.
Fail-fast. Missing required fields or violated admissibility constraints must be detected early and recorded.

4.2 run_id: a stable identifier

Definition 11 (Run signature). A run signature is a canonical JSON object containing:

input archive hash (raw_zip_sha256),
species_id and region_id labels,
algorithm versions (KEY, PCTS),
seed,
key parameters (window sizes, steps, loop parameter \(k\), region_start_abs),
any other LOCK/STATE values that affect outputs.

Definition 12 (Stable run_id). The run_id is a truncated sha256 hash of the canonical JSON encoding of the run signature, e.g. \[\mathrm{run\_id} := \mathrm{sha256}(\mathrm{json\_canonical}(\mathrm{signature}))[:16].\]

Proposition 2 (Identity invariance). If two runs have identical run signatures, then they have the same run_id. If any field of the signature changes, the run_id changes.

Remark 3 (Why truncation is acceptable). Truncation does not change reproducibility semantics; it only shortens filenames. Collisions are possible in principle but negligible for the intended run counts; full hashes remain recorded in DATA_LOCK and MANIFEST where needed.

4.3 DATA_LOCK: freezing inputs and parameters

Definition 13 (DATA_LOCK record). For each run directory, DATA_LOCK stores a containing:

run_id and created_utc,
absolute paths of raw inputs (as executed),
sha256 of raw_zip,
sha256 of extracted FASTA and (if present) GTF/GFF,
a full parameter dictionary equal to the run signature,
optional environment notes.

Proposition 3 (No-input-drift guarantee). If hashes match the current input files, then the run was executed on the same inputs. If they do not match, the run is invalid for the claimed inputs (FAIL).

4.4 MANIFEST: freezing all derived outputs

Definition 14 (MANIFEST record). A MANIFEST is a JSON file listing, for every file under a run root, the relative path, byte size, and sha256 hash.

Proposition 4 (Output integrity guarantee). If the MANIFEST matches the on-disk directory (all listed files exist and sha256 hashes match), then the output artifacts are exactly those produced at run time (up to the MANIFEST generation procedure).

Remark 4 (MANIFEST self-reference). Depending on implementation, the MANIFEST may or may not include its own hash entry. This is acceptable provided the rule is consistent and documented.

4.5 INDEX: append-only run summaries

Definition 15 (INDEX row). An INDEX row records one run summary with fields such as: run_id, created_utc, species_id, region_id, seed, key_version, pcts_version, input hashes, counts (shells/anchors/motors/loops), PCTS totals (T_end, n_events, Pi_T proxy), and pass/fail.

Proposition 5 (Append-only auditability). If the INDEX is append-only, then historical runs remain auditable even when new runs are added.

4.6 Gate outcomes and validation logic

Definition 16 (Gate outcomes). Every gate check returns exactly one of: PASS, FAIL, or INCONCLUSIVE.

4.6.0.1 Integrity gates (mandatory).

G_LOCK_PRESENT: DATA_LOCK is present and parseable.
G_MANIFEST_PRESENT: MANIFEST is present and parseable.
G_MANIFEST_MATCH: MANIFEST hashes match the file tree.
G_SCHEMA_COMPLETE: required outputs exist (A4 layout, snapshots, ledger, verdict).

4.6.0.2 Ledger admissibility gates (mandatory).

G_\(\Delta E\): each JEVENT must satisfy \(\Delta E < 0\).
G_PRES: residual stress proxy must not increase across JEVENT.

4.6.0.3 Reproducibility gates (mandatory).

Under identical LOCK/A4/seed, the event sequence must be reproducible, or the result must be labeled INCONCLUSIVE with diagnostics .

4.7 Invariants implied by the contract

The contract implies the following invariants (used as checklists for both authors and auditors):

Deterministic derivation: outputs are deterministic functions of inputs + parameters + seed.
Complete provenance: every run has a DATA_LOCK; every bundle has a MANIFEST.
Explicit versioning: KEY and PCTS versions are always recorded and cannot be inferred post-hoc.
Closed-world citations: the DOI registry has no missing DOI entries (audit passes).

4.8 Failure modes (and how they appear)

4.8.0.1 Missing inputs.

No FASTA in the raw zip is an execution error (FAIL). Missing GTF is admissible but leads to an A4 without motors/loops; this must be stated (and can be marked INCONCLUSIVE for claims requiring motors).

4.8.0.2 Coordinate mismatch.

If GTF coordinates are not correctly shifted into the local window, motors may fall outside the region, yielding an empty MOTOR set. This is detectable in counts and should be flagged.

4.8.0.3 Partial runs.

If a run directory exists without MANIFEST, it is not validated. The recommended policy is to treat it as incomplete and re-run to completion.

4.8.0.4 Non-deterministic behavior.

If a component depends on unrecorded state (e.g., system time affecting ledger content), reproducibility checks must ignore non-semantic fields or record them explicitly. When reproducibility cannot be established, label INCONCLUSIVE.

4.8.0.5 Post-hoc LOCK tuning.

Any change to LOCKed priors/algorithms after a run invalidates the run for that LOCK version. The correct response is a version bump, not silent modification.

5 KEY: sequence(+annotation) to A4 arrangement

The KEY step defines the deterministic mapping from a region’s 1D sequence (and minimal annotations) into an A4 arrangement. KEY is the only place where structural diversity is introduced; the activity grammar remains fixed.

5.1 Windowed observables and stiffness proxy

Let the region sequence be \(S\) of length \(L\) (bp). Choose a window length \(W\) and step \(\Delta\) (bp). For each window \(S_{[i,i+W)}\), compute:

GC fraction: \[\mathrm{GC}(S)=\frac{\#\{G,C\}}{\#\{A,C,G,T\}}.\]
CpG density (over adjacent pairs): \[\mathrm{CpG}(S)=\frac{\#\{j: S_{j}S_{j+1}=\texttt{CG}\}}{\max(1,|S|-1)}.\]
AT-tract density (proxy): count occurrences of AAAAAA or TTTTTT (overlapping) normalized by window length.

Definition 17 (Stiffness proxy). A stiffness proxy is a linear combination of windowed observables: \[\mathrm{stiff\_raw} := w_{\mathrm{GC}}\cdot \mathrm{GC} + w_{\mathrm{CpG}}\cdot \mathrm{CpG} + w_{\mathrm{AT}}\cdot \mathrm{AT6},\] with fixed weights \((w_{\mathrm{GC}},w_{\mathrm{CpG}},w_{\mathrm{AT}})\).

Remark 5 (Proxy interpretation). The stiffness proxy is a deterministic structural signal used to produce shells and anchors. We do not claim it is a complete physical model of chromatin mechanics; it is a minimal, auditable mapping from sequence statistics to a segmentation.

5.2 Robust normalization and smoothing

Definition 18 (Robust z-score via MAD). Given a list of stiffness values \(\{x_i\}\), define the median \(m\) and MAD \(d=\mathrm{median}(|x_i-m|)\). Then the robust z-score is \[z_i := \frac{x_i-m}{1.4826\,\max(d,\epsilon)},\] with a small \(\epsilon\) to avoid division by zero.

A short-radius moving average smoothing is applied to \(z_i\) to reduce single-window noise without introducing nonlocal dependencies.

5.3 Shell segmentation

Definition 19 (Tercile labeling). Let \(q_1,q_2\) be the 33rd and 66th percentiles of the smoothed \(z_i\) values. Assign a label \(\ell_i\in\{0,1,2\}\) by: \(\ell_i=0\) if \(z_i\le q_1\), \(\ell_i=1\) if \(q_1<z_i\le q_2\), and \(\ell_i=2\) if \(z_i>q_2\).

Definition 20 (Shell). A shell is a maximal contiguous run of windows with identical label \(\ell\). Each shell stores its coordinate interval \([s,e)\) and its mean \(\bar z\).

5.3.0.1 Short-shell merging.

To avoid unstable segmentation artifacts, shells shorter than a minimum length (e.g., 5 kb) are merged into neighbors using a continuity heuristic (merge toward the neighbor with closer mean \(\bar z\)).

Proposition 6 (Deterministic shells). For fixed \((S,W,\Delta)\) and fixed merging rules, the shell list is deterministic.

5.4 Boundaries and anchor construction

Definition 21 (Boundary strength). For adjacent shells \(a\) and \(b\) with mean values \(\bar z_a,\bar z_b\), define boundary strength \[\sigma := |\bar z_b - \bar z_a| \ge 0,\] located at the coordinate where shell \(b\) begins.

Definition 22 (Anchor kinds). Anchors are created at:

region edges (\(0\) and \(L\)),
each shell boundary position (strength \(\sigma\)),
optional shell cores (e.g., midpoints of extremal shells) to stabilize reference points.

Proposition 7 (Anchor invariants). All anchors satisfy:

positions lie in \([0,L]\),
strengths are nonnegative,
anchor identifiers are unique within a run.

5.5 Motor extraction from annotations

When annotations are provided, we extract motors as transcript start sites.

Definition 23 (TSS extraction rule). For a transcript with genomic coordinates \((\mathrm{start},\mathrm{end})\) and strand \(\in\{+,-\}\): \[\mathrm{TSS}_{\mathrm{abs}} := \begin{cases} \mathrm{start} & \text{if strand }=+\\ \mathrm{end} & \text{if strand }=-. \end{cases}\] Local coordinates are obtained by subtracting a region start offset: \(\mathrm{TSS}_{\mathrm{local}}=\mathrm{TSS}_{\mathrm{abs}}-\mathrm{region\_start\_abs}\).

Only motors with \(0\le \mathrm{TSS}_{\mathrm{local}}\le L\) are retained.

Remark 6 (Annotation provenance). In this whitepaper, mouse mm39 transcript annotations are sourced from RefSeq/refGene via UCSC, and the assembly provenance is documented separately .

5.6 Loop generation

Definition 24 (\(k\)-nearest anchor rule). For each motor position \(p\), define its candidate anchor set by selecting the \(k\) anchors minimizing \(|p-a|\). Each selected pair defines a loop with distance \(d=|p-a|\).

Proposition 8 (Loop invariants). For every loop:

motor_id and anchor_id exist in the corresponding tables,
distance \(d\) is nonnegative and consistent with stored positions,
loop identifiers are unique.

5.7 KEY versions

KEY_v0_3_1 (single-scale)

KEY_v0_3_1 uses a single window scale \((W,\Delta)\) to produce one shell hierarchy.

KEY_v0_4 (multiscale)

KEY_v0_4 uses two scales:

fine tier: \((W_f,\Delta_f)\),
coarse tier: \((W_c,\Delta_c)\).

Anchors derived from both tiers are merged by positional proximity within a tolerance \(\tau\) bp, retaining the stronger anchor when duplicates occur.

Proposition 9 (Multiscale anchor merge determinism). For fixed fine/coarse outputs and fixed merge tolerance \(\tau\), the merged anchor list is deterministic.

5.8 Outputs

KEY emits:

window feature TSVs,
shell and boundary TSVs,
anchor, motor, and loop TSVs,
a minimal A4 YAML summarizing the arrangement.

5.9 Failure modes and diagnostic signals

5.9.0.1 FASTA multiplicity.

If the input FASTA contains multiple sequences, a strict reader may ignore additional headers; this must be documented or pre-filtered. For validated bundles, we require a single-sequence region FASTA.

5.9.0.2 Excess ambiguous bases.

High N-content reduces the reliability of sequence statistics. Autoscan typically filters windows with insufficient valid base fraction; such filters should be recorded in parameters.

5.9.0.3 Missing or mismatched GTF.

If GTF is absent, motors/loops are empty. If GTF coordinates are not aligned to the region, motors may all fall outside. Both cases are visible in motor/loop counts and should be surfaced explicitly in INDEX and gates.

5.9.0.4 Parameter extremes.

Very small \(W\) can overfit local noise; very large \(W\) can wash out boundaries. These are not “errors” but affect shell/anchor density and should be explored only via STATE parameter sweeps, never post-hoc LOCK changes.

6 PCTS: minimal ledger generator (PCTS_v0_2)

PCTS_v0_2 maps an A4 layout to a time-indexed ledger under a fixed 4-activity grammar. It is explicitly a ledger generator, not a biological predictor.

6.1 State variables and outputs

Definition 25 (Run state). A minimal run state contains:

the set of loops and the subset marked locked,
a residual stress proxy \(\mathrm{pres}(t)\) derived from remaining unlocked loops,
deterministic pseudo-random state (seeded), if used for proxy updates.

Definition 26 (Outputs). PCTS_v0_2 emits:

snapshots.csv: step-wise summary (e.g., pres, mean distance),
: event entries for JEVENT steps,
: gate results,
verdict.json: summary (PASS/FAIL, T_end, n_events, Pi_T proxy).

6.2 Residual stress proxy

Definition 27 (Residual stress proxy). Let \(\mathcal{L}_{\mathrm{unlocked}}(t)\) be the set of loops not yet locked at time \(t\). Define \[\mathrm{pres}(t) := \frac{1}{\max(1,|\mathcal{L}_{\mathrm{unlocked}}(t)|)}\sum_{\ell\in\mathcal{L}_{\mathrm{unlocked}}(t)} d(\ell),\] where \(d(\ell)\) is the motor–anchor distance of loop \(\ell\).

This proxy is intentionally simple: it is an auditable scalar that should not increase across irreversible JEVENT updates.

6.3 Four-activity progression

PCTS advances in discrete steps \(t=0,1,2,\dots\) with the following fixed grammar:

INIT: initialize locked set \(\emptyset\) and compute initial pres.
SCONSERV: conservative projection (placeholder in v0_2; preserves admissible state representation).
SDISSIP: dissipative relaxation of the pres proxy (deterministic given seed).
JEVENT: irreversible events that lock eligible loops and write J_LEDGER entries.

6.4 Event rule: loop stabilization

Definition 28 (Lock threshold). Each loop \(\ell\) is associated with an anchor strength \(\alpha(\ell)\ge 0\). Define a lock threshold (bp) \[d_{\mathrm{lock}}(\ell) := d_0\cdot \left(1 + c\cdot \min(2,\alpha(\ell))\right),\] where \(d_0\) is a base threshold and \(c\) is a fixed coefficient.

Definition 29 (J_LOOP_STABILIZATION event). At time step \(t\), an unlocked loop \(\ell\) is eligible for stabilization if \[d(\ell) \le d_{\mathrm{lock}}(\ell).\] When eligible, the loop is marked LOCKED and an event row is appended to J_LEDGER with: (event_type = , involved IDs, trigger summary, \(\Delta E < 0\), pres_pre, pres_post).

6.5 Admissibility gates (what must be true)

Proposition 10 (\(\Delta E\) negativity). In PCTS_v0_2, all JEVENT entries satisfy \(\Delta E<0\) by construction.

Proposition 11 (Non-increasing residual stress across steps). The snapshot pres list is required to be non-increasing: \[\mathrm{pres}(t)\le \mathrm{pres}(t-1)\quad \forall t\ge 1.\] This is checked as a mandatory gate (FAIL if violated).

Proposition 12 (Monotone locking). The number of locked loops is non-decreasing in time, and never exceeds the total loop count.

6.6 Reproducibility and what is compared

6.6.0.1 Semantic reproducibility.

We distinguish semantic reproducibility (event skeleton) from non-semantic fields (e.g., timestamps).

Definition 30 (Event skeleton). The event skeleton is the ordered list of tuples: \[(t\_\mathrm{step}, \mathrm{event\_type}, \mathrm{loop\_id}, \mathrm{motor\_id}, \mathrm{anchor\_id}).\]

Proposition 13 (Deterministic skeleton under fixed A4 and seed). For fixed A4 and fixed seed, the event skeleton is deterministic in PCTS_v0_2. If a future implementation includes additional stochasticity affecting eligibility, the skeleton must still be reproducible or labeled INCONCLUSIVE with diagnostics.

6.7 Pi_T proxy and interpretation

PCTS_v0_2 records a Pi_T proxy as a compact time-normalized summary. In v0_2, this proxy is intentionally simplified to track effective run length (T_end steps); future versions may incorporate additional LOCKed priors.

6.8 Failure modes and edge cases

6.8.0.1 No loops.

If the loop set is empty (e.g., no motors), the ledger may be empty and the run may simply report no events. This is admissible for “structure-only” interpretations but should be flagged when claims require motor-driven coupling.

6.8.0.2 Max-step truncation.

If max_steps is reached before all loops are locked, the run must report this condition (e.g., n_locked \(<\) n_loops). Depending on the intended claim, this may be FAIL or INCONCLUSIVE.

6.8.0.3 Schema drift.

If a J_LEDGER schema changes between versions, the schema version must be LOCKed and recorded. A run that cannot be interpreted under its declared schema is invalid.

6.8.0.4 Non-increasing pres gate violation.

A violation indicates either a bug or an incompatible definition of pres for the intended dynamics. In either case, the gate provides a fail-fast diagnostic.

6.9 Summary: why PCTS_v0_2 is sufficient for the whitepaper

For the whitepaper claim, PCTS_v0_2 provides:

a fixed activity grammar,
deterministic mapping from A4 to an irreversible event ledger,
explicit gates to validate admissibility,
audit artifacts to support third-party verification.

This is exactly the “interpreter” role described by the LOCK\(\rightarrow\)Derive\(\rightarrow\)Gate contract .

7 Mouse/mm39 reference data provenance

This section specifies exactly what is meant by “mouse mm39 data” in this whitepaper, where it comes from, and how provenance is verified. The purpose is to prevent the most common failure mode in genome-scale work: silent assembly/annotation drift.

7.1 Assembly identity and coordinate system

Definition 31 (Assembly identity). Throughout this document, mm39 refers to the mouse reference assembly GRCm39. All genomic coordinates are interpreted in that assembly’s coordinate system.

Proposition 14 (Coordinate consistency requirement). All pipeline steps that combine sequence and annotation (MOTOR extraction) require that:

chromosome naming conventions match (e.g., chr1 vs 1),
the annotation coordinates are in the same assembly as the FASTA,
any window-level coordinate shift () is recorded and applied consistently.

If any of these conditions fail, MOTOR counts and positions become invalid for the intended region (FAIL or INCONCLUSIVE depending on claim scope).

7.2 Primary sources for genome sequence and annotations

7.2.0.1 Genome sequence (FASTA).

We obtain the mouse mm39 sequence via the UCSC Genome Browser download infrastructure . For autoscan and repeat proxy construction, we use soft-masked chromosome FASTA where repeats are indicated by lowercase bases.

7.2.0.2 Gene/transcript annotation (refGene/RefSeq).

We obtain transcript models from RefSeq-derived sources as provided through UCSC (e.g., refGene/RefSeq tracks) . This provides a minimal annotation sufficient to extract transcript start sites (TSS) as MOTOR positions.

7.2.0.3 Assembly provenance.

We cite NCBI Assembly for assembly identity and provenance, and GenBank as the primary archive underpinning assembled genome distribution .

7.2.0.4 Repeat masking provenance.

Soft masking and repeat-oriented handling is consistent with standard RepeatMasker-based pipelines .

7.3 Provenance verification: what we hash and why

Definition 32 (Input verification by hash). For every run, input files are verified by sha256 hashes recorded in DATA_LOCK:

for the region archive,
for the extracted region FASTA,
(or empty) for the extracted region GTF.

Proposition 15 (Reproducible provenance). If two users download the same upstream UCSC/NCBI sources and run the same autoscan/extraction scripts, they will generate the same region FASTA/GTF, hence the same hashes and run_id (given the same selection parameters and seed).

Remark 7 (Why we hash the extracted region files). Hashing the upstream multi-gigabyte genome archive is not sufficient if the extraction pipeline can differ. We hash the derived region FASTA and region GTF to freeze exactly what KEY and PCTS see.

7.4 Practical conventions used in this whitepaper

7.4.0.1 Chromosome set.

The autoscan typically targets canonical chromosomes (chr1–chr19, chrX, chrY) and can exclude unlocalized contigs unless a specific analysis requires them.

7.4.0.2 Softmask fraction as repeat proxy.

The repeat proxy axis is defined as the fraction of lowercase A/C/G/T bases in soft-masked FASTA windows. This is not a repeat family decomposition; it is a single scalar suitable for archetype balancing.

7.4.0.3 TSS de-duplication.

When multiple transcript entries share the same TSS, MOTOR entries can be de-duplicated or retained depending on policy. The policy must be recorded (LOCK or STATE) and reflected in counts.

7.5 Failure modes (and how to detect them)

7.5.0.1 Wrong assembly (mm10 vs mm39).

Symptom: many transcripts fall outside extracted regions; MOTOR counts near zero in gene-dense areas; inconsistent chromosome lengths. Mitigation: enforce assembly metadata, and hash the exact inputs.

7.5.0.2 Chromosome naming mismatch.

Symptom: annotation parser produces zero motors because chromosome IDs do not match. Mitigation: normalize names at download/extraction; record normalization in parameters.

7.5.0.3 Softmask unavailable or inconsistent.

Symptom: repeat proxy axis collapses (softmask fraction near zero for all windows). Mitigation: ensure soft-masked FASTA is used; record whether masking is present.

7.5.0.4 Annotation version drift.

Symptom: gene density distribution shifts; tercile cutoffs change substantially. Mitigation: treat the annotation release as part of LOCK; record hashes; regenerate Table 1 when inputs change.

7.5.0.5 Assembly gaps / high-N windows.

Symptom: windows contain too many ambiguous bases; GC/CpG estimates become unstable. Mitigation: apply a valid-base fraction filter (recorded in parameters) and label excluded windows.

8 mm39 autoscan to representative region panel

This section specifies the whole-genome autoscan and the deterministic selection of a representative region panel (N\(\approx\)30–100) for evidence. The purpose is to prevent cherry-picking: we want a compact but diverse set of regions that covers the structural archetypes observed in the genome.

8.1 Windowing scheme and candidate set

Let the genome be represented as a set of chromosomes \(\{C_k\}\) with sequences. Fix a window length \(W\) and step \(\Delta\). For each chromosome, we generate windows \([i,i+W)\) for \(i=0,\Delta,2\Delta,\dots\) with truncation at chromosome end if desired. The autoscan produces a candidate set \(\mathcal{W}\) of windows.

Definition 33 (Valid-base fraction filter). For a window, define \[\mathrm{valid\_frac} := \frac{\#\{A,C,G,T\}}{W}.\] Windows with \(\mathrm{valid\_frac} < v_{\min}\) are excluded from archetype statistics and selection.

Remark 8 (Why filtering is required). If a window contains too many ambiguous bases, GC/CpG and softmask fraction become unreliable. Filtering is therefore a mandatory part of a validated autoscan; \(v_{\min}\) is recorded in parameters.

8.2 Autoscan observables (the 4 structural axes)

For each candidate window, compute the following observables:

Gene density: number of TSS falling inside the window, normalized per Mb (from RefSeq/refGene) .
Repeat proxy: softmask fraction (fraction of lowercase A/C/G/T).
GC fraction: \(\mathrm{GC}=\frac{\#(G+C)}{\#(A+C+G+T)}\).
CpG density: count of CG dinucleotides per kb; classical CpG island context is discussed in .

8.3 Binning: L/M/H terciles per axis

Definition 34 (Axis terciles). Given an observable \(x\) over candidate windows, define \(q_1,q_2\) as the 33rd and 66th percentiles of \(x\). Assign a bin \(\in\{L,M,H\}\) by: \[\mathrm{bin}(x)= \begin{cases} L & x \le q_1\\ M & q_1 < x \le q_2\\ H & x > q_2. \end{cases}\]

Proposition 16 (Deterministic binning). For fixed candidate window set \(\mathcal{W}\) and fixed filtering rules, tercile cutoffs and bin assignments are deterministic.

8.4 Archetype definition (4-axis bins)

Definition 35 (Archetype ID). Each window is assigned an archetype \[a := (\mathrm{G},\mathrm{R},\mathrm{GC},\mathrm{CPG}) \in \{L,M,H\}^4,\] where G is gene-density bin, R is repeat-proxy bin, GC is GC bin, and CPG is CpG bin. The archetype space has \(3^4=81\) combinations.

Definition 36 (Observed archetype set). Let \(\mathcal{A}_{\mathrm{obs}}\) be the subset of archetypes that occur at least once among candidate windows. Note that \(\mathcal{A}_{\mathrm{obs}}\) may be strictly smaller than 81.

Remark 9 (Why unobserved archetypes matter). If an archetype does not occur in \(\mathcal{W}\), it cannot be covered by any selection. Coverage must therefore be evaluated relative to \(\mathcal{A}_{\mathrm{obs}}\), not the full 81 grid.

8.5 Selection algorithm (coverage-first, uniformity-second)

We select a representative panel of \(N\) windows with the following goals:

Coverage-first: select at least one window from each archetype in \(\mathcal{A}_{\mathrm{obs}}\) whenever \(N \ge |\mathcal{A}_{\mathrm{obs}}|\).
Uniformity-second: distribute remaining selections so that counts per covered archetype are as even as possible.
Spacing constraint: enforce a minimum separation \(d_{\min}\) between selected windows on the same chromosome.
Deterministic tie-breaking: if multiple candidates are equivalent, choose the one with lexicographically minimal (chromosome, start) after sorting, or via a seeded deterministic pseudo-random order recorded in parameters.

Proposition 17 (Selection determinism). Given (i) candidate windows with computed bins, (ii) fixed \(N\) and spacing rule, and (iii) a fixed deterministic tie-breaking rule (including seed if used), the selected panel is deterministic.

8.6 Region extraction and STEP18 job construction

Each selected window becomes a region package:

region.fa: the exact subsequence extracted from the genome FASTA,
region.gtf: transcripts restricted to the region (TSS used for motors),
raw.zip: zip archive containing both files.

A job list (JSONL) is generated, where each job includes: species_id, region_id, raw_zip path, seed, KEY version, region_start_abs, and windowing parameters.

Proposition 18 (End-to-end determinism (autoscan \(\rightarrow\) runs)). Fix upstream sources (FASTA and annotations), fix autoscan parameters and seed, and fix STEP18 parameters. Then the pipeline produces the same set of raw region zips, the same run_ids, and the same derived outputs.

8.7 Failure modes (autoscan/selection/extraction)

8.7.0.1 Insufficient observed archetypes for a given \(N\).

If \(N < |\mathcal{A}_{\mathrm{obs}}|\), full coverage is impossible. The correct response is to report partial coverage explicitly (not to silently change bins).

8.7.0.2 Unstable cutoffs due to small candidate sets.

If filtering is too strict, the number of candidate windows shrinks and tercile cutoffs become noisy. Mitigation: increase genome coverage, adjust \(v_{\min}\), or increase window size; record the change.

8.7.0.3 Spacing constraint reduces achievable coverage.

On small chromosomes or with large \(d_{\min}\), spacing can prevent selecting one window per archetype. This must be reported as a constraint-driven coverage loss.

8.7.0.4 Annotation mismatch (gene density axis invalid).

If transcript coordinates do not match FASTA, gene density collapses or becomes unreliable. Mitigation: verify provenance (Section 7) and hash exact inputs.

8.7.0.5 Softmask missing (repeat axis invalid).

If FASTA is not soft-masked, repeat proxy provides no discrimination. Mitigation: use soft-masked FASTA or replace repeat proxy by an explicit repeat track; record policy change as LOCK/STATE.

9 Results: archetype coverage (Table 1; Figure 1)

This section records the autoscan binning and panel selection outcomes as fixed evidence. For management reasons, we do not embed external image files in this TeX project. Instead, we render Figure [fig:archetype_grid] directly in TeX as a 9\(\times\)9 occupancy grid (a deterministic view of the selected panel over the 4-axis archetype space).

9.1 Bin cutoffs and marginal balance (Table 1)

Table 3 records the tercile cutoffs \((q_1,q_2)\) for each axis and the marginal balance of the selected panel across L/M/H. This is a compact validation that the representative selection is not overly biased on any single axis.

mm39 autoscan archetype bins and selection balance (Table 1).
Axis	Observable	\(q_{1}\)	\(q_{2}\)	Sel(L)	Sel(M)	Sel(H)	Notes
G	TSS per Mb (refGene transcripts)	10	18.8	25	29	26
R	softmask fraction (repeat proxy; lowercase A/C/G/T)	0.40285	0.473893	26	30	24
GC	GC fraction (A/C/G/T only; N excluded)	0.400741	0.427662	23	33	24
CPG	CpG per kb (count(`CG`)/kb; N excluded)	6.908133	9.018333	24	35	21
Archetype (4-axis)	Archetype = (G_bin, R_bin, GC_bin, CPG_bin), each in (L,M,H)						N_selected=80; candidate_windows=522; archetypes_present=49; archetypes_covered=49/81 (60.5%)
Uniformity	Counts over covered archetypes (nonzero 9x9 cells)						min/median/max=1/2.0/2; entropy_ratio=0.988; gini=0.142; CV=0.295
Parameters	Scan and selection settings						assembly=mm39(UCSC); chroms=chr1–19,X,Y; win=5,000,000; step=5,000,000; valid_frac\(\ge\)0.8; bins=terciles; target_N=80; selection=coverage-first; dup(max2)

9.1.0.1 Coverage note.

In the locked evidence bundle underlying Table 3, the autoscan candidate set contains 49 observed 4-axis archetypes. The selected \(N=80\) panel covers all 49 observed archetypes (100% coverage relative to observed), which corresponds to 49/81 (60.5%) of the full \(3^4\) archetype grid.

9.2 Archetype coverage (how it is computed)

Definition 37 (Coverage over observed archetypes). Let \(\mathcal{A}_{\mathrm{obs}}\) be the set of archetypes observed in the candidate windows, and let \(\mathcal{A}_{\mathrm{sel}}\) be the set of archetypes present in the selected panel. Define the coverage fraction as: \[\mathrm{coverage} := \frac{|\mathcal{A}_{\mathrm{sel}}|}{|\mathcal{A}_{\mathrm{obs}}|}.\]

Remark 10 (Coverage over the full 81 grid). Reporting \(|\mathcal{A}_{\mathrm{sel}}|/81\) is not meaningful if many archetypes are unobserved in the genome under the chosen windowing/filtering scheme. Therefore, we treat coverage relative to \(\mathcal{A}_{\mathrm{obs}}\) as the primary metric.

9.3 Uniformity metrics (optional but recommended)

Coverage alone does not guarantee uniform evidence. To summarize uniformity across covered archetypes, one may report:

min/median/max counts per covered archetype,
normalized entropy of the archetype count distribution,
Gini coefficient of the archetype count distribution.

These metrics are deterministic functions of the selected panel and are suitable for inclusion in validated bundles.

9.4 Figure 1: TeX-native archetype occupancy grid

9.5 Validation statements and failure modes

9.5.0.1 Validation statements.

A result bundle supporting Table 3 is validated if:

the autoscan candidate set and selection parameters are recorded and hashed,
Table 3 is regenerated deterministically from the recorded candidate windows,
the selected panel list is reproducible under identical inputs and seed,
downstream STEP18 runs are validated by MANIFEST and gate tables .

9.5.0.2 Failure modes.

Binning drift: cutoffs change because upstream inputs changed (assembly/annotation drift) or because filtering/windowing changed without being recorded.
Coverage inflation: reporting coverage on 81 rather than on \(\mathcal{A}_{\mathrm{obs}}\) can mislead; always report both with clear definitions if needed.
Selection nondeterminism: unrecorded randomness in tie-breaking leads to different panels; fix a deterministic rule or record the seed.
Constraint-driven gaps: spacing constraints can prevent full coverage; report as such.

9.5.0.3 Bottom line.

Table 3 and the TeX-native grid figure (Fig. [fig:archetype_grid]) jointly fix the evidence that the panel selection covers the observed 4-axis archetype space under a finite N budget.

10 Interpretation: stiffness-shell emergence in LOCK–Derive–Gate DNA Interpreter DNA

This section explains what it means (within this framework) for structure to “emerge” from DNA, and why the claim can be supported with deterministic artifacts rather than informal narrative. The key point is that the activity grammar is fixed, so any diversity in dynamics must arise from the arrangement A4.

10.1 From sequence statistics to shells and anchors: a deterministic coarse-graining

KEY converts 1D sequence into a 1D structural signal (stiffness proxy), then coarse-grains it into shells.

Definition 38 (Stiffness-shell coarse-graining). A stiffness-shell coarse-graining is the composed mapping \[S \mapsto \{z_i\}_{i=1}^n \mapsto \text{(labeled runs)} \mapsto \text{shell list},\] where \(z_i\) is a robustly normalized window score and shells are maximal runs of constant tercile label.

Shell boundaries produce anchors with strengths proportional to the discontinuity of shell means.

Definition 39 (Anchor field). The anchor field of a region is the set of positions \(\{p_j\}\) and strengths \(\{\alpha_j\}\) derived from boundaries (plus edges and optional cores).

Proposition 19 (No “extra” degrees of freedom). For fixed KEY parameters, the shell list and anchor field are deterministic functions of the input sequence. No additional degrees of freedom are introduced at the interpretation stage other than those recorded in LOCK/STATE.

Remark 11 (Why shells are the right level of abstraction here). Shells deliberately discard base-level microstructure and keep only piecewise-homogeneous segments of the proxy signal. This aligns with the design goal: diversity is expressed by arrangement geometry (where anchors are and how strong they are), not by adding activity categories or ad hoc event types.

10.2 Motors and loops as constrained couplings

Motors are extracted from minimal annotation (TSS). Loops connect motors to nearby anchors by a fixed rule.

Definition 40 (Motor–anchor coupling graph). Let \(M\) be the motor set and \(A\) be the anchor set. Loops define a bipartite graph \(G=(M\cup A, E)\) with \(E\subseteq M\times A\).

Proposition 20 (Graph determinism). For fixed A4 construction rules (including \(k\)-nearest anchor selection), the coupling graph \(G\) is deterministic given the input sequence and annotation.

Remark 12 (Role of annotation). Annotations do not change shells/anchors (sequence-driven), but they populate motors and therefore loops. This cleanly separates “sequence structure” from “motor activation” under a minimal contract.

10.3 JEVENT as irreversible stabilization (ledger semantics)

A JEVENT is not “a continuous dynamics”; it is an irreversible ledger step. In PCTS_v0_2, JEVENTs lock eligible loops and must dissipate energy.

Definition 41 (Irreversibility constraints). A JEVENT entry is admissible if:

\(\Delta E < 0\) (dissipation),
residual stress proxy does not increase across the event,
the event is reproducible under identical LOCK/A4/seed, or labeled INCONCLUSIVE.

Proposition 21 (Event diversity comes from A4). Fix the activity grammar and fix the PCTS algorithm. Then the event skeleton is a deterministic function of A4(and seed if used). Therefore, across regions, any diversity in event sequences is attributable to differences in A4 structure (shells/anchors/motors/loops), not to changes in activity categories.

10.4 What “emergence” means here

In this whitepaper, “emergence” is not a metaphysical claim. It is a concrete statement:

Claim 3 (Emergence as structure-induced irreversibility). Given a fixed activity grammar, region-specific A4 structure induces region-specific irreversible event progressions (ledger sequences).

This claim is validated operationally by:

deterministic KEY outputs (shell/anchor maps),
deterministic PCTS outputs (snapshots and ledger),
gate validation (admissibility constraints),
run_id stability and MANIFEST hashing.

10.5 Scale behavior and multiscale KEY

A common concern is whether evidence from short regions generalizes to larger scales. KEY_v0_4 addresses this by introducing a multiscale shelling (fine and coarse tiers) and merging anchors.

Definition 42 (Multiscale representation). A multiscale representation uses two shell hierarchies (fine, coarse) derived from two window sizes, and merges resulting anchors within a tolerance.

Proposition 22 (Scale-robust anchoring (qualitative)). For fixed merging tolerance, multiscale anchoring reduces sensitivity to any single window scale by retaining strong boundaries across scales. This does not make the proxy “more true” biologically; it makes the structural interpreter more stable under scale variation.

10.6 Why representative panels generalize

The global claim is: any region is interpretable. Evidence cannot show all regions, so we use archetype panels.

Definition 43 (Archetype-based evidence strategy). We treat the 4-axis archetype bins (gene density / repeat proxy / GC / CpG) as a coarse descriptor of the structural factors that drive shell/anchor diversity. A representative panel is an evidence set that covers the observed archetype space under a finite budget \(N\).

Proposition 23 (Generalization within the contract). Given the contract, interpretability does not require that every region appear in the evidence set. It requires that the procedure be defined for every region and that representative evidence demonstrates the procedure across the archetype space that the genome actually exhibits.

10.7 Limits of interpretation (truthfulness constraints)

Remark 13 (What is and is not claimed). This framework claims completeness of an interpreter (a deterministic mapping and a validated ledger), not completeness of biological understanding. The stiffness proxy, the anchoring heuristic, and the event rule are explicit and auditable; their biological fidelity is a separate research question.

Remark 14 (Why this is still a meaningful “100%” claim). “100%” refers to completeness of the mapping under the LOCK–Derive–Gate DNA Interpreter definition: every region produces a structural arrangement and a validated ledger (or an explicit failure label) under LOCKed rules. This is a strong and testable claim, and it is supported by the artifacts defined in Sections 3–8 .

11 Validation and failure modes

This section specifies what it means for a run or bundle to be validated, how to interpret failures, and what kinds of failures exist. Validation is a first-class output of the framework, not an optional add-on.

11.1 Validation levels

Definition 44 (Validation levels). We distinguish three validation levels:

Integrity validation: hashes (DATA_LOCK, MANIFEST) match the file system.
Schema validation: required outputs exist and parse under declared schemas.
Admissibility validation: gate constraints hold (e.g., \(\Delta E<0\), pres non-increase).

A run may pass integrity but fail admissibility; such outcomes must be reported explicitly.

11.2 Bundle-level validation

Definition 45 (Validated bundle). A bundle is validated if:

every included run has a DATA_LOCK record,
every included run has a MANIFEST record and hashes match,
required files for KEY and PCTS exist for every run,
gate outcomes are recorded and a bundle-level summary is provided,
the DOI registry has no missing DOI entries (audit passes).

Proposition 24 (Reproducibility from validation). If a bundle is validated and a third party re-runs the pipeline with the identical recorded inputs and parameters, then the third party can reproduce the same run_ids and verify outputs by MANIFEST. This does not guarantee identical timestamps, but it guarantees semantic reproducibility of derived artifacts.

11.3 Gate taxonomy (recommended minimal set)

11.3.0.1 Integrity gates.

DATA_LOCK present and parseable.
MANIFEST present and parseable.
MANIFEST hashes match.

11.3.0.2 Schema completeness gates.

KEY outputs exist: anchors/shells/boundaries and minimal A4 YAML.
PCTS outputs exist: snapshots, ledger, verdict, gate table.
J_LEDGER required fields are present.

11.3.0.3 Admissibility gates.

\(\Delta E<0\) for JEVENT entries.
pres_post \(\le\) pres_pre across JEVENT.
monotone pres across steps (if defined as such).

11.3.0.4 Reproducibility gates.

under identical LOCK/A4/seed, event skeleton is reproducible,
otherwise marked INCONCLUSIVE with diagnostics.

11.4 Interpreting FAIL vs INCONCLUSIVE

Definition 46 (FAIL). FAIL means an admissibility constraint or integrity constraint is violated. The output cannot be used to support claims that require the violated property.

Definition 47 (INCONCLUSIVE). INCONCLUSIVE means that a check cannot be resolved under available data while preserving auditability. INCONCLUSIVE is not a hidden pass; it is an explicit status that prevents overclaiming.

11.5 Failure modes by layer

Input/provenance failures

FASTA missing in raw zip (execution error; FAIL).
FASTA/annotation assembly mismatch (motors outside region; INCONCLUSIVE or FAIL depending on claim).
Chromosome naming mismatch (zero motors; likely FAIL for motor-dependent claims).
Softmask missing (repeat proxy axis invalid; FAIL for autoscan evidence if repeat axis is required).

KEY failures

High-N windows (unstable statistics; should be filtered; otherwise INCONCLUSIVE).
Parameter drift (window/step changed without recording; integrity may pass but scientific claim fails).
Boundary collapse (too few anchors due to large window or overly aggressive merging; not a bug but must be reported).

PCTS/ledger failures

\(\Delta E\) violation (\(\Delta E\ge 0\) in any JEVENT; FAIL).
Residual stress increase (pres gate violated; FAIL).
Max-step truncation (not all loops locked; mark as INCONCLUSIVE unless explicitly allowed).
Schema mismatch (ledger fields missing; FAIL).

Packaging failures

Missing MANIFEST (run not validated; treat as incomplete).
MANIFEST mismatch (files changed post-run; FAIL).
INDEX inconsistencies (counts not matching actual outputs; indicates bug; FAIL).

11.6 Parameter sensitivity and what is allowed

Remark 15 (LOCK vs STATE sensitivity). Sensitivity studies are allowed only through STATE parameters that are recorded in DATA_LOCK and therefore alter run_id. Post-hoc changes to LOCKed rules invalidate the run and require a version bump.

Remark 16 (What “robust” means here). Robustness does not mean “unchanged under any parameter”. It means: under recorded parameter changes, outcomes are reproducible and differences are attributable to explicit parameter deltas.

11.7 Limits and non-goals (truthfulness constraints)

The framework does not guarantee biological predictive accuracy; it guarantees interpretable outputs under a contract.
The autoscan archetype evidence is a coverage argument, not a proof of biological completeness.
The choice of proxies (GC/CpG/softmask/TSS density) is explicit and auditable; alternatives can be substituted only by explicit versioning.

12 Roadmap

This section describes how the present manuscript and pipeline can be extended while preserving the contract. The roadmap is structured around versioned artifacts and evidence expansion, not around informal narrative.

12.1 Immediate manuscript-completion tasks

Freeze a final TeX asset management strategy (single central figure bundle; this project uses placeholders).
Expand Appendix B into a fully executable command list matching the released bundles.
Produce a “release candidate” validated bundle that corresponds exactly to Table 1 and the final managed figures.

12.2 Evidence expansion (within mouse mm39)

12.2.0.1 Increase panel diversity under fixed binning.

Under the current 4-axis binning, increase \(N\) from 30–100 toward larger panels (e.g., 200) while maintaining spacing constraints, to provide denser evidence coverage.

12.2.0.2 Add scale tiers.

Add 10 Mb, 50 Mb, and 200 Mb panels to stress-test multiscale anchoring and ledger stability.

12.2.0.3 Add alternative annotation sources (optional).

GENCODE provides an alternative reference annotation resource . Such additions must be treated as explicit provenance changes (new inputs; new hashes; new runs), not as silent replacements.

12.3 Cross-organism extension

Apply the same autoscan and panel strategy to additional vertebrates.
Maintain the same contract: LOCKed grammar, deterministic KEY to A4, deterministic PCTS to ledger, validated packaging.
Compare archetype distributions across organisms and quantify which archetypes exist or are absent under comparable windowing.

12.4 Toward organ/context panels (without changing activity grammar)

Organ/context specificity should be introduced without adding new activity types. Instead, it can be introduced by:

defining MOTOR activation policies as explicit STATE inputs (e.g., selecting subsets of TSS),
defining A5 STATE parameters for context-dependent thresholds,
producing separate validated panels per context where necessary.

Remark 17 (Why this is compatible with the core design). The activity grammar remains fixed; context affects only arrangement instantiation (which motors are active) and STATE thresholds, both of which are recorded in DATA_LOCK and therefore reproducible.

12.5 Versioning strategy

Definition 48 (Version bump policy). A version bump is required when:

LOCK rules change (activity grammar, segmentation algorithm, schema),
provenance sources change (assembly/annotation releases),
output schema changes (fields added/removed).

STATE changes do not require a version bump but do require new run_ids and explicit recording.

12.6 Long-term scientific directions (non-binding)

Improve physical interpretation of the stiffness proxy and anchor strength without sacrificing auditability.
Introduce richer, still-auditable observables beyond GC/CpG/softmask (e.g., k-mer spectra, methylation priors if available).
Expand event catalogs (A1) only if justified and still compatible with the 4-activity grammar (i.e., event types expand but activity categories do not).

12.7 Release plan

A minimal publishable release includes:

this manuscript (TeX),
a validated bundle corresponding to the reported tables/figures,
scripts for autoscan, selection, STEP18 execution, and validation,
a DOI registry with no missing DOI entries.

This aligns with the contract baseline .

13 File formats and schemas

This appendix summarizes the minimal file formats used by the pipeline. All formats are versioned; schema changes require version bumps under the contract .

13.1 A5 priors YAML

A5 is a YAML file that separates LOCK values (frozen priors) from STATE values (allowed run-time parameters). A5 files should include:

meta: name, version, created_utc, purpose,
globals: temperature-like constants, barrier thresholds, derived invariants,
domain blocks (e.g., DNA/membrane/oocyte) with tag: LOCK or tag: STATE.

13.2 A4 layout YAML

The minimal A4 YAML contains:

meta: key_version, created_utc, fasta_header, length_bp,
anchors: list of {anchor_id, pos, kind, strength},
motors: list of {motor_id, tss_local, strand},
loops: list of {loop_id, motor_id, anchor_id, motor_pos, anchor_pos, distance_bp},
shells (and in multiscale: , ),
boundaries (or /).

13.3 TSV outputs

KEY emits TSV tables with headers:

shells.tsv: shell_id, start, end, label, mean_z,
boundaries.tsv: left_shell, right_shell, pos, strength,
anchors.tsv: anchor_id, pos, kind, strength,
motors.tsv: motor_id, tss_local, strand,
loops.tsv: loop_id, motor_id, anchor_id, motor_pos, anchor_pos, distance_bp.

13.4 PCTS outputs

PCTS emits:

snapshots.csv: t_step, n_loops, n_locked, pres, mean_distance_bp,
: run_id, created_utc, event_id, event_type, t_step, anchor_ids, motor_ids, loop_ids, trigger_summary, deltaE_diss, pres_pre, pres_post, lock_state_pre, lock_state_post, notes,
: gate_id, status, notes,
verdict.json: pass flag, T_end_steps, n_events, Pi_T_proxy, n_loops, n_locked.

14 Reproducibility cookbook (exact commands)

This appendix is written as an executable checklist. It describes how to reproduce the artifacts used by this whitepaper: A4 outputs, J_LEDGER outputs, and the mm39 autoscan panel evidence (Table 3 and the external archetype grid figure managed separately).

14.1 B.0 What you should obtain at the end

At the end of this cookbook you should have:

A directory containing:
- (per-region raw zips: region.fa+region.gtf),
- (one subdirectory per run_id with DATA_LOCK, DERIVED, OUTPUT, MANIFEST),
- INDEX.csv (append-only run summaries),
- (candidate window metrics and selected panel lists).
A validated bundle zip that can be shared for third-party verification: (name may vary by release).
A regenerated and corresponding .
A DOI registry audit that reports no missing DOI entries.

All steps follow the LOCK\(\rightarrow\)Derive\(\rightarrow\)Gate contract .

14.2 B.1 Environment setup

14.2.0.1 Bundlekit root.

This whitepaper assumes you use the released end-to-end bundlekit: . After extracting the zip, set:

export KIT_DIR=$PWD/mouse_mm39_autoscan_END2END_bundlekit_v1_0
cd "$KIT_DIR"

All script paths below are relative to .

14.2.0.2 Minimum tools.

You need:

Python 3.10+,
standard Unix tools: bash, tar, gzip, curl (or wget),
enough disk for mm39 inputs (gigabyte scale) and derived outputs.

14.2.0.3 Python dependencies.

The autoscan/selection scripts use common packages (CSV parsing, YAML). A minimal install typically includes pyyaml, pandas (optional), and standard library modules. If your release bundle provides a requirements.txt, install it:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

14.2.0.4 Working directory convention.

Throughout this cookbook we use:

export WORK_DIR=$PWD/WORK_MM39_AUTOSCAN
mkdir -p "$WORK_DIR"

14.3 B.2 Download mm39 inputs (sequence + annotations)

14.3.0.1 Principle.

We cite UCSC as the distribution source for mm39 sequence and refGene/RefSeq-style annotations . We also cite NCBI for assembly identity/provenance .

14.3.0.2 Recommended layout.

Create a data cache directory:

export DATA_DIR=$PWD/DATA_SOURCES
mkdir -p "$DATA_DIR/mm39"

14.3.0.3 Genome sequence (soft-masked).

Download the soft-masked chromosome FASTA archive from UCSC and extract:

cd "$DATA_DIR/mm39"
# Recommended: use the provided download helper (UCSC)
./scripts/download_ucsc_mm39.sh

# If you downloaded a tar.gz manually, extract with:
tar -xzf mm39.chromFa.tar.gz
# expected output: a directory like chromFa/ containing chr*.fa files

If your release bundles a single merged FASTA, use that instead. The key requirement is that repeat masking is present if you use the softmask fraction axis.

14.3.0.4 Gene annotation (refGene/RefSeq).

Download and decompress the refGene-derived GTF/GFF provided by your release (or UCSC tables export). For example:

cd "$DATA_DIR/mm39"
# curl -L -o refGene.gtf.gz <UCSC_URL>/mm39/...
gzip -d refGene.gtf.gz

14.3.0.5 Hash and freeze inputs.

Compute and store sha256 hashes for the downloaded archives and the final extracted files you will actually use:

cd "$DATA_DIR/mm39"
sha256sum mm39.chromFa.tar.gz > SHA256SUMS.txt
sha256sum refGene.gtf >> SHA256SUMS.txt

14.4 B.2.1 One-command end-to-end run (recommended)

If you want the fully scripted end-to-end execution (download \(\rightarrow\) autoscan \(\rightarrow\) selection \(\rightarrow\) extraction \(\rightarrow\) STEP18 \(\rightarrow\) validation), run the provided driver script:

cd "$KIT_DIR"
./scripts/RUN_MM39_AUTOSCAN_END2END_v1_0.sh

This script is the canonical reference for the v1.0 kit. It records parameters, produces artifacts under a work directory, and is intended to be used to reproduce the evidence bundle hashes recorded in the Evidence lock.

14.5 B.2.2 Canonical WORK_DIR tree produced by the end-to-end script

This subsection freezes the expected on-disk artifact layout after running:

./scripts/RUN_MM39_AUTOSCAN_END2END_v1_0.sh

A run is considered structurally complete (schema-level validation) only if the required nodes below exist. Optional nodes are marked (optional).

14.5.0.1 Canonical tree (v1.0).

WORK_DIR/
  AUTOSCAN/
    mm39_windows_metrics.csv
    mm39_windows_archetyped.csv
    archetype_bins.yaml
    selected_regions.csv
    coverage_report.json

  EVIDENCE/
    TABLE1_mm39_archetype_coverage_v0_1.csv
    mm39_selected_regions_archetypeN80_v0_1.csv
    mm39_archetype_coverage_matrix9x9_v0_1.csv
    RUN_METADATA_mm39_archetype_coverage_v0_1.json
    MANIFEST_mm39_archetype_coverage_v0_1.json
    mm39_inputs_sha256_v0_1.txt
    mm39_archetype_coverage_TABLE1_FIG1_bundle_v0_1.zip

  RAW_REGIONS/
    regions/
      <region_id_0001>.zip
      <region_id_0002>.zip
      ...
    raw_regions_index.csv

  JOBS/
    jobs_mm39_panel.jsonl
    progress.csv
    checkpoint.pkl

  runs/
    <run_id_1>/
      DATA_LOCK/
        data_lock.json
      DERIVED/
        KEY_v0_4/
          A4_layout_min.yaml
          anchors.tsv
          motors.tsv
          loops.tsv
          shells_fine.tsv
          shells_coarse.tsv
          boundaries_fine.tsv
          boundaries_coarse.tsv
          KEY_RUN.json
      OUTPUT/
        PCTS_v0_2/
          snapshots.csv
          J_LEDGER.csv
          gate_table.csv
          verdict.json
      MANIFEST.json
    <run_id_2>/
      ...
  INDEX.csv

  mouse_dna_mm39_autoscan_VALIDATED_bundle.zip   (optional)

14.5.0.2 Notes.

The directory is the canonical source for Table 3 and Fig. [fig:archetype_grid]. The zipped evidence bundle in must match the Evidence lock hashes (Section 0).
The archives are the direct inputs for STEP18. Each must contain exactly one region FASTA and (optionally) one region GTF.
The per-run directory layout under is fixed by the STEP18 pipeline implementation (KEY then PCTS then MANIFEST).
is optional and only exists if you choose to pack the full work directory for sharing.

Proposition 25 (WORK_DIR tree conformance). For v1.0, if the end-to-end script exits successfully and all jobs complete, then every selected region produces exactly one run directory under , and the required files listed above exist. Missing nodes indicate an incomplete or non-conformant run and must be treated as FAIL/INCONCLUSIVE under the contract.

14.6 B.3 Autoscan: compute candidate window metrics

14.6.0.1 Window parameters.

Choose:

window length \(W\) (e.g., 5 Mb),
step \(\Delta\) (often equal to \(W\) for non-overlapping tiling),
valid-base fraction threshold \(v_{\min}\) (e.g., 0.8).

14.6.0.2 Run autoscan.

Your release bundle should include an autoscan script (names may vary). A typical invocation:

python scripts/scan_mm39_windows.py \
  --chrom_fa_dir "$DATA_DIR/mm39/chromFa" \
  --gtf "$DATA_DIR/mm39/refGene.gtf" \
  --win_bp 5000000 \
  --step_bp 5000000 \
  --valid_frac_min 0.80 \
  --out_csv "$WORK_DIR/AUTOSCAN/mm39_windows_metrics.csv"

14.6.0.3 Expected outputs.

The output CSV should contain (at minimum) columns:

chrom, start, end, valid_frac,
gc_frac, cpg_per_kb,
softmask_frac,
tss_count, gene_density_per_mb

14.6.0.4 Sanity checks (recommended gates).

Compute simple checks before selection:

number of candidate windows (\(|\mathcal{W}|\)) is nontrivial,
distributions of GC/CpG/softmask are not degenerate,
gene density axis is not all zero (if using annotations).

If any axis is degenerate, treat the autoscan as INCONCLUSIVE for archetype evidence and fix provenance (Section 7).

14.7 B.4 Compute tercile cutoffs and assign 4-axis archetypes

14.7.0.1 Compute \(q_1,q_2\) per axis.

From , compute terciles for: gene density, softmask fraction, GC fraction, CpG density.

A typical invocation (script name may vary):

# (implemented inside the kit's autoscan pipeline; see RUN_MM39_AUTOSCAN_END2END_v1_0.sh) \
  --metrics_csv "$WORK_DIR/AUTOSCAN/mm39_windows_metrics.csv" \
  --out_yaml "$WORK_DIR/AUTOSCAN/archetype_bins.yaml"

14.7.0.2 Assign bins and archetype IDs.

Then assign L/M/H per axis and an archetype tuple per window:

# (implemented inside the kit's autoscan pipeline; see RUN_MM39_AUTOSCAN_END2END_v1_0.sh) \
  --metrics_csv "$WORK_DIR/AUTOSCAN/mm39_windows_metrics.csv" \
  --bins_yaml "$WORK_DIR/AUTOSCAN/archetype_bins.yaml" \
  --out_csv "$WORK_DIR/AUTOSCAN/mm39_windows_archetyped.csv"

14.8 B.5 Select a representative panel (N=30–100)

14.8.0.1 Selection parameters.

Choose:

TARGET_N (e.g., 80),
minimum separation \(d_{\min}\) on same chromosome (e.g., 5 Mb),
deterministic tie-breaking policy (recorded; may include a seed).

14.8.0.2 Run selection.

Example:

python scripts/select_representative_windows.py \
  --archetyped_csv "$WORK_DIR/AUTOSCAN/mm39_windows_archetyped.csv" \
  --target_n 80 \
  --min_sep_bp 5000000 \
  --seed 0 \
  --out_csv "$WORK_DIR/AUTOSCAN/selected_regions.csv"

14.8.0.3 Coverage report (recommended).

Compute coverage relative to observed archetypes:

# (implemented inside the kit's autoscan pipeline; see RUN_MM39_AUTOSCAN_END2END_v1_0.sh) \
  --archetyped_csv "$WORK_DIR/AUTOSCAN/mm39_windows_archetyped.csv" \
  --selected_csv "$WORK_DIR/AUTOSCAN/selected_regions.csv" \
  --out_json "$WORK_DIR/AUTOSCAN/coverage_report.json"

If TARGET_N is smaller than the number of observed archetypes, full coverage is impossible; report this explicitly (do not silently change bins).

14.9 B.6 Extract raw regions (region.fa + region.gtf) and pack as raw.zip

For each selected window \((\mathrm{chrom},\mathrm{start},\mathrm{end})\):

Extract the region sequence into region.fa.
Extract annotations overlapping the window into region.gtf.
Zip into raw.zip.

Example driver (script name may vary):

python scripts/extract_regions.py \
  --selected_csv "$WORK_DIR/AUTOSCAN/selected_regions.csv" \
  --chrom_fa_dir "$DATA_DIR/mm39/chromFa" \
  --gtf "$DATA_DIR/mm39/refGene.gtf" \
  --out_dir "$WORK_DIR/RAW_REGIONS"

14.9.0.1 Required properties for validation.

Each raw zip must contain:

exactly one FASTA file (region.fa or region.fasta) with a single sequence,
one GTF/GFF file (region.gtf) or none (if running structure-only).

14.10 B.7 Run STEP18 (KEY + PCTS) for the panel

14.10.0.1 Generate jobs JSONL.

Create one job per region zip:

python scripts/build_jobs_jsonl.py \
  --raw_regions_dir "$WORK_DIR/RAW_REGIONS" \
  --species_id mouse_mm39 \
  --key_version KEY_v0_4 \
  --seed 0 \
  --work_dir "$WORK_DIR" \
  --out_jsonl "$WORK_DIR/JOBS/jobs_mm39_panel.jsonl"

14.10.0.2 Batch run.

Run the batch runner:

python batch_run.py \
  --jobs_jsonl "$WORK_DIR/JOBS/jobs_mm39_panel.jsonl" \
  --checkpoint "$WORK_DIR/JOBS/checkpoint.pkl" \
  --progress_csv "$WORK_DIR/JOBS/progress.csv" \
  --save_every 1

As runs complete, INDEX.csv under is appended (append-only policy).

14.11 B.8 Validate runs and pack a validated bundle

14.11.0.1 Per-run validation (integrity).

For each run directory, verify:

exists,
MANIFEST.json exists,
manifest hashes match on re-check.

A release typically includes a validator script. Example:

python validate_bundle.py --work_root "$WORK_DIR" --strict

14.11.0.2 Pack runs for sharing.

After validation, pack the full work directory into a zip (excluding raw extracted intermediates if desired):

python pack_runs.py \
  --work_dir "$WORK_DIR" \
  --out_zip "$WORK_DIR/mouse_dna_mm39_autoscan_VALIDATED_bundle.zip"

14.12 B.9 Regenerate Table 1 (CSV and LaTeX)

14.12.0.1 Canonical generator.

The v1.0 kit includes which can regenerate: (i) Table 1 CSV, (ii) the 9\(\times\)9 archetype occupancy matrix, and (iii) the selected region catalog, from recorded autoscan artifacts. Use that script as the canonical source of Table 1 / Fig 1 evidence for the whitepaper.

Table 3 is derived from the autoscan bin definitions and the selected panel marginal counts. To regenerate it deterministically:

14.12.0.2 Generate the CSV.

python make_table1_archetype_bins.py \
  --bins_yaml "$WORK_DIR/AUTOSCAN/archetype_bins.yaml" \
  --selected_csv "$WORK_DIR/AUTOSCAN/selected_regions.csv" \
  --out_csv "tables/TABLE1_mm39_archetype_coverage.csv"

14.12.0.3 Generate sections/table1.tex from the CSV.

If your release includes a converter script, use it. Otherwise, a simple Python conversion is sufficient. The conversion must be deterministic and should not perform hidden formatting decisions.

14.13 B.10 Figure management policy (this TeX project embeds no images)

This TeX project intentionally embeds no external figure files. Instead:

Figure [fig:archetype_grid] is a TeX-only placeholder.
The final archetype grid figure should be managed as a separate asset bundle and inserted at release time.

To regenerate the figure externally, use and and compute the 9\(\times\)9 flattened occupancy grid described in Section 8.

14.14 B.11 DOI audit

A release should include a DOI registry file (e.g., ) and a DOI audit script. Run:

python doi_audit.py --registry CITATION_REGISTRY.yaml

A passing audit is required for a validated release (no missing DOI entries).

14.15 B.12 Minimal acceptance checklist (what to check before claiming reproducibility)

Before claiming that results are reproducible, verify:

Input provenance is fixed and hashed (Section 7).
Autoscan metrics CSV exists and is deterministic under re-run.
Archetype bins YAML and selected panel CSV reproduce exactly under identical parameters and seed.
STEP18 runs produce the same run_ids and the same MANIFEST hashes under re-run.
Gate tables PASS (or failures are explicitly labeled) and INDEX reflects the counts.
Table 3 regenerates exactly from recorded artifacts.
DOI audit passes.

If any item fails, treat the claim as FAIL/INCONCLUSIVE, diagnose, and re-run with explicit versioning rather than post-hoc edits .